Unprecedented artificial intelligence (AI) progression depend on recurrent testing of swaths of data.
Owing to the potential to determine complex patterns in massive amounts of data, deep learning models are now becoming important tools to solve complex data science tasks like natural language processing (NLP) and image classification.
Benchmark testing: what it is
In an ever-changing world of technology, benchmark testing plays a critical role to predict how intelligent AI technology is — to detect weaknesses, and further build stronger and smarter models.
From MNIST to GLUE or ImageNet, benchmarks played a significant role to drive progress in AI research. They provide a specific target for the community to achieve, quantitative measures to compare model performance, and a common objective to exchange ideas.
However, whenever a new benchmark is introduced, it gets saturated easily. The rate at which AI is growing, it is making the existing benchmarks saturate at a rapid pace. As and when a new NLP model is developed, benchmarks tend to fall back.
Benchmarks keep saturating every two months a new NLP model is released. Though for historical reasons, these benchmarks are static. Only in recent times, we’ve noticed it is time-consuming and were expensive to collect. And to place humans and models in the loop together may not be a good idea, since these models were difficult.
Therefore, AI researchers need to spend more time while developing a new benchmark to further improve AI’s performance.
Introducing Facebook Dynabench Benchmarking
It is time to rethink how we need to benchmark machine learning models.
To address this challenge, AI researchers at Facebook released Dynabench, a platform used for data collection and benchmarking. This approach involves both humans and state-of-the-art (SOTA) AI models in a loop to develop a new dataset and detect how often these models can make a mistake when humans try to make a fool of them.
This technique is also called dynamic adversarial data collection, Dynabench can easily demonstrate how humans can fool AI. And according to Facebook, this could be a great determinant of the model’s quality as compared to the current benchmarking features.
Most often rich analytical models are difficult to develop due to multiple interactions between different job interference, application complexity, network topologies, and node components. In such a situation, a machine learning-based performance model helps. Using machine learning algorithms and methods are efficient in determining unknown interaction of the system and application by using application runs. This is where benchmarking plays a critical role to evaluate the right model to be used to bolster AI performance.
“Douwe Kiela, Facebook researcher says, reliance on faulty benchmarks stunts AI growth. You end up with a system that is better at the test than humans are but not better at the overall task. It’s very deceiving because it makes it look like we’re much further than we actually are.”
As a result, the Dynabench metric will demonstrate better AI models in situations that matter most. For instance, when interacting with people who tend to react in a complex situation that cannot be reflected in a fixed set of data points.
The current static benchmark challenges:
👉 Forces the AI community to specifically put their complete focus on one specific task. Whereas, they should not be worrying about a specific metric or a task but rather how efficient the AI system is functioning while people are interacting with them.
👉 When a new benchmark is released, it shouldn’t be made too easy or too hard. This will make it likely to soon become outdated.
👉 Consist of annotation artifacts and inadvertent biases. For instance, modern machine learning algorithms are perfect tools used to exploit biases in benchmark datasets. Therefore, researchers must be careful against overfitting a specific data set.
Now is the right time to improve the way AI researchers do benchmarking.
How Facebook Dynabench Benchmarking Improves AI Models
Dynabench allows AI researchers to exactly determine how perfect NLP models are in present times. As a result, the process yields data that can be used further to train other models.
The core idea behind Dynabench is to leverage human creativity while challenging the models.
Machines are not far too close to comprehend language the way humans do. But in Dynabench, a language model can be made to classify a review for sentiment analysis. Now, the hyperboles of language can easily fool the model. Therefore, what human annotators do is, they keep adding these adversarial examples till the model cannot be fooled by humans.
In this manner, humans are in a continuous loop of every progress the machine makes, unlike traditional benchmarking.
As AI engineers, researchers, and computational linguists start using Dynabench to improve the performance of their AI models, the platform will accurately track which set of examples are fooling the models leading to irrelevant predictions.
Facebook Dynabench will improve current benchmarking practices. Therefore, making lesser mistakes and having less harmful biases.