HongShan Capital Group has introduced Xbench, an AI benchmark designed to combat AI models that merely memorize training data. Unlike static evaluations, Xbench employs regular updates and real-world tasks to assess genuine reasoning. The benchmark utilizes both traditional tests and practical applications, with a portion of the question set available open-source. Currently, ChatGPT o3 tops the leaderboard. Xbench includes components like Xbench-ScienceQA for STEM knowledge assessment and Xbench-DeepResearch, which tests a model’s proficiency in Chinese web navigation and research. Practical skills are evaluated using simulations of recruitment and marketing workflows. Future iterations of Xbench are planned to include assessments of creativity, collaboration, and reliability. Zihan Zheng, a researcher involved in LiveCodeBench Pro, recognizes the challenges in quantifying these qualitative aspects but views Xbench as a significant step forward in AI benchmarking.
Xbench: New AI Benchmark Fights Cheating with Dynamic Updates
