Confidential AI Benchmark (ARC-AGI-X): Little impact on the crypto market

robot
Abstract generation in progress

Headline

Wharton School scholar Ethan Mollick proposes a “confidential” ARC-AGI-X benchmark to more fairly assess AI models

Summary

Ethan Mollick (Wharton associate professor, author of “Co-Intelligence,” 2024 TIME100 AI honoree) proposed the idea of the “ARC-AGI-X” benchmark on social media: to have a trusted third party host the tests, with the questions and formats kept confidential, while the leaderboard is public but the test content remains secret, preventing models from training specifically on the test questions. His core idea is to improve evaluation methods to truly measure the progress of general intelligence, rather than continuing to reward scale stacking and “answer cramming.”

Analysis

The existing ARC-AGI benchmark was proposed by François Chollet in 2019 and tests “fluid intelligence” through novel grid puzzles. Human accuracy exceeds 85%, while AI systems (even ARC-AGI-3 in 2026) still fall below 50%. The reasons for the gap include:

  • The public question pool leads to overfitting, causing models to “grind” questions rather than learn
  • Reliance on inefficient exhaustive search instead of efficient reasoning

Mollick’s approach is to use a “confidential question pool + external expert validation” to prevent “teaching to the test,” forcing models to genuinely improve in reasoning and generalization. This addresses an old problem: public question pools make models “appear stronger,” but they may not have genuinely transferable abilities.

The results of the 2025 ARC Prize also highlight this:

  • Scores improved through reinforced reasoning loops and adaptive testing
  • But efficiency still lags far behind humans
  • Therefore, benchmarks should focus more on “learning efficiency and generalization,” rather than “memory and fine-tuning gains”

Possible impacts include:

  • Experimental Design: It could prompt labs like OpenAI and Anthropic to adjust evaluation methods, reducing purely “score grinding” practices
  • Competitions and Open Source: If the confidentiality mechanism gains acceptance, it could enhance the comparative effectiveness of the open-source ecosystem, reducing misleading AGI milestone hype
  • Industry Communication: Mollick continues to bridge academia and industry, promoting “practically usable evaluation frameworks” to enter mainstream discussions

Key Information:

  • Core Judgment: The overfitting and “score grinding” issues of existing public benchmarks severely distort the assessment of models’ true reasoning abilities; confidential evaluations may help
  • Market Relevance: Recent impacts on crypto asset pricing and trading sentiment are weak, with discussions remaining at the level of AI evaluation methods
  • Observation Point: If the subsequent crypto AI sector begins adopting the notion of “confidential benchmarks/leaderboards,” it may draw short-term attention

Impact Assessment

  • Importance: High (affects AI evaluation methods and industry discourse)
  • Category: Technical Insight, AI Research, Industry Trends

Conclusion: For crypto traders and short-term funds, this topic is currently irrelevant; the real beneficiaries are researchers focused on AI evaluation and model capability verification. If you are an active trader in the crypto market, there is no need for immediate action; long-term investors can passively track and wait for signals indicating “AI evaluation mechanisms impacting the crypto AI sector.”

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin