5 Ways AI Leaderboards Should Evolve, According to the Experts

Leaderboards once felt like a niche research tool; now they set the narrative for nearly every new model announcement. But as AI moves beyond labs and into sensitive, meaningful interactions with everyday users, these leaderboards seem increasingly insufficient. Criticisms mount: from superficial gaming of scores and underrepresentation of open-source models, to severe misalignments between leaderboard metrics and real-world capabilities.
In our recent webinar, "Why AI leaderboards miss the mark," we gathered industry leaders to dissect these limitations. Their analysis highlighted an urgent need for benchmarks that go beyond purely technical measures and instead prioritize genuine user experience, ethical alignment, and real-world relevance. This resonates deeply with Prolific’s mission and informs our commitment to creating more human-centric evaluation frameworks.
Here are five expert-backed recommendations on how leaderboards must truly reflect AI's real-world impact truly.
1. Making benchmarks memorization-resistant
One critical shortcoming highlighted by our panellists was the vulnerability of benchmarks to memorization. This is where models achieve high scores through rote recall rather than genuine reasoning or adaptability. Nora Petrova, an AI researcher at Prolific, pointed out how this undermines the fundamental purpose of evaluation, which is to measure how effectively models can generalize to new, real-world scenarios.
Nora specifically cited the RKGI reasoning benchmark as an example designed to resist memorization by emphasizing scenarios that require authentic understanding and reasoning capabilities. By crafting tasks that are inherently resistant to memorization, benchmarks can better differentiate between superficial performance gains and genuine model intelligence.
2. Increasing Diversity in Tasks and Evaluations
Hua Shen, a postdoctoral scholar at the University of Washington, who led research efforts on bidirectional humans - AI alignment, emphasized the need to diversify the tasks and evaluations used in leaderboards. Hua pointed out that current evaluations often rely heavily on specific, narrowly-defined tasks, limiting their applicability to broader, real-world scenarios. To achieve greater generalizability, benchmarks should incorporate diverse scenarios and involve evaluators with varied backgrounds and perspectives.
Building upon Hua’s perspective, Nora suggested that establishing standardized evaluation frameworks across different research labs would help ensure consistent comparisons. Standardization would clarify results, allowing researchers to confidently interpret performance metrics by ensuring that they're comparing similar tasks—“apples to apples”. Together, these improvements would help create benchmarks more reflective of actual AI performance across diverse real-world contexts.
3. Shifting from Marginal Gains to Real-World Applications
Hua also highlighted a critical issue: the excessive focus on marginal, incremental improvements in technical benchmarks rather than meaningful advances in real-world applications. Leaderboards often incentivize narrow optimization rather than genuine usefulness, diverting valuable resources away from developing AI models that significantly benefit end users.
She proposed a shift toward embedding AI systems within realistic scenarios, tasks, and contexts that reflect genuine user needs. By prioritizing real-world applicability and tangible benefits for users, AI benchmarks can encourage the development of more practical, impactful AI systems.
A shift like this would open the door to broader evaluations. Instead of chasing small performance gains, it would prompt developers and researchers to focus on what really matters. That includes improving how AI works in real-world situations and how people experience it.
4. Benchmarking the Benchmarks: Understanding Common Patterns
Oliver Nan, a Research Scholar at Cohere and author of "The Leaderboard Illusion," underscored the necessity of critically evaluating benchmarking practices themselves. Oliver observed that while new benchmarks frequently emerge, there remains insufficient comparative analysis of their methodologies and results.
He advocates for more meta-analytical research to identify common patterns and biases across benchmarks. By systematically reviewing how benchmarks are designed, implemented, and evaluated, researchers can uncover best practices and common pitfalls.
Nora complemented this perspective, mentioning initiatives like BetterBench that have begun surveying the benchmarking landscape. Such efforts to "benchmark the benchmarks" would significantly enhance the transparency, reliability, and effectiveness of AI evaluations, leading to more robust and informative leaderboard designs.
5. Expanding Beyond Technical Skills
Nora further argued for expanding the scope of leaderboards beyond purely technical metrics. According to her, true AI performance should also encompass alignment with human values, emotional intelligence, and ethical considerations—qualities not captured by traditional, technically-focused benchmarks.
By broadening evaluations to include aspects such as trustworthiness, adaptability, and effective communication, leaderboards could better reflect an AI system's potential impact on human users.
This approach resonates with Prolific’s mission of incorporating diverse human perspectives into AI evaluations, ensuring that technology development remains closely aligned with the values, expectations, and real-world needs of everyday users.
Ultimately, embracing these broader, human-centric dimensions can drive the development of AI systems that are not only technically proficient but also genuinely beneficial and responsible in their interactions with people.
Shaping Human-Centric Leaderboards
AI is moving fast, and the way we evaluate it needs to keep up. At Prolific, we're focused on building benchmarks that reflect how AI affects real people, capturing what it’s like to use in the real world and whether it acts in ways people can rely on. Our goal is to move beyond scores for the sake of scores and instead highlight what really matters: how AI performs in the messy, unpredictable situations that make up everyday life.
For deeper insights from leading experts, watch our full webinar, "Why AI leaderboards miss the mark."