Blog Post

AI & LLM’s: Llama 3 and the “Benchmark Saturation” principle explained

May 20, 2024 Artificial Intelligence, Research by kescoda

The recent launch of LLAMA 3 by Meta (formerly Facebook) has reignited the debate on measuring the performance of artificial intelligence (AI) models and the validity of current benchmarks. This new open-source model represents a significant step forward in the democratization of AI, but it also raises important questions about the objective evaluation of AI system capabilities.

Article originally prepared in Italian for my personal podcast Disruptive Talks (read it here).
This content is also available as an audio podcast , available here, and in Live Stream video, here.

LLAMA 3, an acronym for “Large Language Model Meta AI,” has been made available in two main variants: one with 8 billion parameters and another with 70 billion. This differentiation allows for greater flexibility in the model’s application, adapting to various computational and performance needs. The model is freely accessible to developers through the Meta Llama Downloads portal, although it is currently not available in Italian (at the time I published this review).

Llama 3: claims and reality

Like ChatGPT, LLAMA 3 is a large language model designed for natural language understanding and code generation. Its architecture, based on transformers, allows it to process and generate text. Meta claims that LLAMA 3 offers improved computational efficiency compared to some other models of similar size, potentially making it more accessible to researchers and developers with varying resource constraints.

Meta’s adoption of an open-source approach for LLAMA 3 represents a significant turning point in the AI landscape. This approach offers numerous advantages:

Transparency: open-source code allows for detailed examination of the model’s architecture and functioning.
Collaboration: the global research community can contribute to improving the model.
Accelerated innovation: free access stimulates the development of new applications and use cases.
Reduction of barriers: It democratizes access to advanced AI technologies.

However, the open-source approach also presents challenges, particularly regarding performance standardization and comparability with closed proprietary models.

Regarding performances, and despite Meta’s claims, some industry experts have expressed contrasting opinions regarding LLAMA 3’s performance compared to leading models such as OpenAI’s GPT-4 or Mistral AI, with some praising its performance while others suggest it may not match the capabilities of certain proprietary models in all tasks. These differing assessments highlight the ongoing challenges in comparing AI models and raise important questions about evaluation methodologies and the effectiveness of current benchmarks.

The Benchmark Dilemma: saturation and obsolescence

The discussion about LLAMA 3’s capabilities has brought to light a broader issue in the field of AI: the saturation and obsolescence of current benchmarks. This phenomenon is well illustrated by a graph from the “Exponential View” newsletter, showing the rapid progression of AI system performance compared to a human baseline.

The graph clearly shows how various AI benchmarks, including “ImageNet Top-5” for image recognition, “SQuAD” for text comprehension, and “SuperGLUE” for language understanding capabilities assessment, have reached or surpassed human performance in a relatively short period. This rapid progression highlights two fundamental problems:

Benchmark saturation: this occurs when an AI model reaches the maximum score in a given test, rendering the benchmark ineffective in distinguishing further improvements.
Accelerated obsolescence: benchmarks quickly become obsolete as AI models improve, requiring the continuous creation of new, more challenging tests.

Benchmark saturation is a complex phenomenon that deserves in-depth analysis. Technically, it manifests when an AI model achieves performance that approaches or exceeds 100% accuracy on a given test set. This phenomenon can be attributed to several factors suche as overfitting (models can “memorize” the correct answers for test datasets, rather than developing true generalizable understanding), intrinsic limitations of datasets (many benchmarks are based on finite datasets that may not adequately represent real-world complexity), evolution of model architectures (with the advent of increasingly sophisticated architectures, such as the transformers used in LLAMA 3, models can exploit patterns in data that were not anticipated when the benchmark was created) or bias in training data (if training data and benchmarks share similar biases, models can perform well in tests without necessarily improving in real-world scenarios).

To illustrate this concept, let’s consider the “ImageNet Top-5” benchmark. In this test, a model must correctly identify an image among the top 5 predictions. In 2015, AI models surpassed human performance on this benchmark, achieving 95.06% accuracy compared to 94.9% human accuracy. Since then, incremental improvements have brought accuracy close to 99%, making the benchmark less useful for distinguishing between high-quality models.

Alongside benchmark saturation, the AI field is experiencing accelerated obsolescence of evaluation metrics. This phenomenon occurs when benchmarks become outdated at an increasingly rapid pace due to the swift advancements in AI capabilities. Key aspects of this accelerated obsolescence include:

Rapid model improvements: as AI models like LLAMA 3 and others advance quickly, benchmarks that were challenging just months ago can become trivial, failing to capture the full extent of new capabilities.
shifting goalposts: the definition of what constitutes advanced AI performance is constantly evolving, necessitating frequent updates to evaluation criteria.
emergence of new capabilities: novel AI functionalities, such as advanced reasoning or multimodal understanding, may not be adequately assessed by existing benchmarks.
increased complexity of real-world tasks: as AI is applied to more complex and nuanced real-world scenarios, simple benchmarks become less representative of actual performance needs.
speed of research and development: the pace of AI research and development outstrips the rate at which new, meaningful benchmarks can be created and standardized.

This accelerated obsolescence challenges the AI community to continuously innovate in evaluation methodologies, ensuring that benchmarks remain relevant and indicative of genuine progress in the field. It also underscores the need for more dynamic and adaptable evaluation frameworks that can evolve alongside AI capabilities.

How to solve this issue?

Meta’s launch of LLAMA 3 perfectly illustrates the challenges related to AI benchmarks. Before release, Meta had announced that LLAMA 3 outperformed ChatGPT-4 in almost all aspects. However, post-release evaluations showed less definitive results. This discrepancy raises several issues:

Evaluation methodology: how were pre-release tests conducted, and how representative were they of the model’s actual performance?
Rapid model evolution: is it possible that ChatGPT-4 improved in the period between initial tests and LLAMA 3’s public release?
Limitations of current benchmarks: the benchmarks used may not adequately capture all the nuances of advanced language models’ capabilities.

To address these challenges, the AI community is exploring new approaches to model evaluation. One promising direction is the development of dynamic benchmarks, which are systems designed to automatically evolve and remain challenging as model performance improves. This adaptive approach ensures that evaluations remain relevant even as AI capabilities rapidly advance. Another innovative strategy is the implementation of continuous evaluation. Instead of relying on point-in-time tests, this method proposes ongoing monitoring of performance in real-world scenarios, providing a more comprehensive and realistic assessment of AI capabilities.

Researchers are also focusing on creating generalization tests, which evaluate a model’s ability to apply learned knowledge to entirely new domains. This approach aims to measure not just memorization or pattern recognition, but true understanding and adaptability.

Lastly, there’s a growing emphasis on interdisciplinary collaboration in benchmark development. By involving experts from various fields, the AI community aims to create more comprehensive and representative benchmarks that can better capture the multifaceted nature of intelligence and its applications in diverse real-world contexts.

These new approaches collectively represent a shift towards more holistic, dynamic, and responsible methods of evaluating AI models, addressing the limitations of current benchmarks and paving the way for more meaningful assessments of AI progress.

The path to AGI (Artificial General Intelligence) remains uncertain, not only due to technical challenges but also due to the lack of a universally accepted definition of what constitutes artificial general intelligence. In this context, the development of more robust and meaningful benchmarks and evaluation methodologies becomes crucial not only for measuring progress but also for guiding responsible and ethical AI development.

The LLAMA 3 case serves as an important reminder that, in the rapid advancement of AI, our ability to measure and understand these advancements must evolve in tandem. Only through critical and continuous evaluation of our measurement methods can we hope to effectively navigate the complex waters of AI development, ensuring that technological progress translates into tangible and sustainable benefits for society.

For further inquiries or assistance with Artificial Intelligence, please feel free to reach out.

Notes

To further explore the topics discussed in this article and stay up-to-date with the rapidly evolving field of AI benchmarking and evaluation, consider the following resources:

Official LLAMA 3 Resources

Meta AI LLAMA Page: official information about LLAMA 3, including downloads and documentation.
LLAMA 3 Paper: scientific paper written by the Llama team.

AI Benchmarking Resources

Papers With Code Benchmarks: a comprehensive list of AI benchmarks across various domains, updated regularly with state-of-the-art results.
MLPerf: an industry-standard benchmark suite for machine learning.

Critical Discussions on AI Evaluation

The Mythos of Model Interpretability: a seminal paper discussing the challenges in interpreting complex AI models.
On the Dangers of Stochastic Parrots: a critical examination of large language models and their evaluation.

AI Ethics and Safety

AI Ethics Guidelines Global Inventory: a collection of AI ethics guidelines from around the world.
Future of Life Institute: resources on AI safety and beneficial AI development.

Newsletters and Blogs

Exponential View: the newsletter mentioned in the article, offering insights into technological progress.
AI Alignment Forum: discussions on aligning advanced AI systems with human values.

Academic Perspectives

Stanford HAI: Stanford’s Human-Centered AI Institute, offering research and insights on AI development and evaluation.
MIT Technology Review: regular coverage of AI advancements and their implications.

Interactive AI Experiences

AI Test Kitchen: Google’s platform for experimenting with AI models and providing feedback.
Hugging Face Model Hub: a platform to explore and compare various AI models, including some open-source alternatives to LLAMA 3.

Future Trends in AI Evaluation

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList: a paper proposing new methodologies for comprehensive AI model evaluation.