Tencent has unveiled ArtifactsBench, a novel benchmark designed to rigorously evaluate the performance of creative AI models. Unlike existing methods, ArtifactsBench focuses on assessing both the visual fidelity and interactive integrity of AI-generated code intended for creative applications. The benchmark operates by providing an AI with a creative task and then automatically building and executing the resulting code. Screenshots are captured throughout the process to meticulously document the AI’s behavior. A Multimodal Large Language Model (MLLM) acts as a judge, scoring the output based on crucial factors like functionality, user experience, and aesthetic quality. Tencent reports that ArtifactsBench demonstrates a high degree of consistency with human evaluations. Initial tests using ArtifactsBench have yielded the interesting insight that generalist AI models tend to outperform specialized models when faced with these multifaceted creative tasks.
ArtifactsBench: Tencent’s AI Benchmark Tests Creative AI Functionality and Aesthetics
