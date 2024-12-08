Andrew Jackson, CEO of Inception

Inception, a G42 company in collaboration with the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) announced the launch of AraGen Leaderboard, a framework designed to redefine the evaluation of Arabic Large Language Models (LLMs).

The AraGen Leaderboard introduces an internally developed metric, 3C3H, which evaluates models across six key dimensions: correctness, completeness, conciseness, helpfulness, honesty, and harmlessness. We aim to offer an open platform that benchmarks AI models, balancing factual accuracy with usability and setting a new standard for Arabic Natural Language Processing (NLP). As we move into accelerated adoption of AI-enabled products within organizations and enterprises, it is increasingly important to have a relevant and fit-for-purpose method of measuring and benchmarking the quality of AI models. We recognized this need and wanted to contribute to the effort.

“Our long-standing partnership with MBZUAI, built on the development of JAIS and other underserved large language models (LLMs), has always focused on creating AI solutions for underserved communities. This collaboration has now expanded with the creation of the AraGen Leaderboard, a groundbreaking framework for evaluating Arabic LLMs,” said Andrew Jackson, CEO of Inception.

The AraGen leaderboard is another step towards strengthening the AI ecosystem in the UAE and beyond, empowering researchers, developers, and organizations to build AI solutions that are culturally and linguistically relevant to the region, Jackson said.

Excerts from an interview

Can you elaborate on the challenges in evaluating Arabic Large Language Models (LLMs) that AraGen addresses?

Evaluation is inherently complex, with two main approaches: automatic benchmarks and preference-based benchmarks. While automatic benchmarks are efficient, they often fail to evaluate real-world outputs and can be easily manipulated, intentionally or unintentionally. Preference-based benchmarks, on the other hand, face reproducibility challenges—an essential aspect of any scientific process—and are prone to biases, whether based on crowd-sourced humans or AI preferences.

Arabic evaluation introduces an additional layer of complexity due to the unique linguistic and cultural nuances of the language. We were exposed to these challenges when developing JAIS, the Arabic LLM, and we know the inherent differences between Arabic and English when training AI models. These challenges amplify the limitations of traditional evaluation metrics, which are often designed for static, English-centric use cases and overlook key features needed for other languages.

The AraGen Leaderboard addresses these challenges by offering a robust framework that avoids benchmark leakage, ensures reproducibility, and integrates a holistic set of metrics. It evaluates both core knowledge and practical utility, pushing the boundaries of innovation in Arabic LLM development.

How does the dynamic nature of AraGen prevent benchmark leakage and ensure the reproducibility of results?

In AI evaluation, striking a balance between transparency, openness, and safeguarding your contributions is a significant challenge. Benchmarks often find their way into training datasets, deliberately or as a result of data collection. These challenges make benchmark results either biased or obsolete.

To address this, we have implemented a dynamic framework that evolves with model capabilities. We will periodically update the benchmark set, releasing older test sets for community validation and reproducibility. This approach ensures the framework remains trustworthy while promoting transparency.

Moreover, the dynamic nature of AraGen mitigates issues like model performance saturation caused by benchmark hacking or contamination. By adapting to the evolving ecosystem, AraGen not only evaluates models effectively but also incentivizes developers to create more robust and optimized solutions over time. This dual focus on safeguarding fairness and driving innovation ensures that AraGen remains a catalyst for progress in Arabic NLP and potentially other leaderboards for other languages and tasks.

This effort aligns with our mission at Inception and G42, which focuses on Responsible AI. We are proud to be stewards of this important topic that fosters inclusivity and preserves linguistic diversity.

Are there plans to expand AraGen’s approach to other underrepresented languages in the future? We are dedicated to ensuring AI serves all communities, not just a select segment of the global population. This commitment aligns with our Group’s “Intelligence Grid” roadmap, emphasizing inclusivity and responsible AI development. While AraGen focuses on Arabic, its core framework is versatile and can be adapted to other languages or tasks. We plan to collaborate with developers from diverse regions, tailoring the framework to their linguistic and cultural needs. However, implementing such frameworks requires significant resources, which is why they remain rare globally. To ensure sustainability and meaningful impact, we condition our collaborations on a demonstrated, long-term commitment from these communities to maintain and support their language’s ecosystem. This shared dedication is essential for building robust, dynamic frameworks that evolve with the needs of their respective linguistic and cultural landscapes. Our efforts aim to empower underserved communities by addressing their unique linguistic challenges responsibly. Building on the success of AraGen, JAIS, and our recent Hindi LLM, “Nanda,” we envision fostering a global, inclusive AI ecosystem that supports diverse languages and cultures. Can you share examples of potential use cases or industries that will benefit most from the AraGen Leaderboard? The AraGen Leaderboard is designed as a versatile evaluation framework not tied to specific industries or use cases. It guides developers to select models that align with their application’s needs and resources. For instance, if a developer prioritizes honesty over conciseness, they can identify the best-performing model for that dimension, minimizing hallucinations and improving output reliability. Additionally, AraGen provides filters such as model size and precision, helping developers align their choices with resource constraints and deployment goals. The leaderboard’s task-specific insights further enable organizations to identify models optimized for their target applications, such as safety or conversational AI.

By centralizing evaluation efforts, AraGen eliminates the need for individual organizations to undertake resource-intensive evaluation processes. This promotes collaboration and accelerates progress in building impactful, culturally aligned AI solutions for the Arabic-speaking world.