A groundbreaking experiment featuring five autonomous AI models – Claude, GPT-4o, Gemini, Grok, and DeepSeek – has provided valuable insights into their behavior when evaluating geopolitical crisis scenarios. Each model independently assesses over 30 crisis scenarios twice daily, without access to the others’ outputs, and an orchestrator synthesizes their reasoning into final projections.
After 15 days of continuous operation, significant observations emerged, including substantial disagreements between models, sometimes exceeding 25 percentage points. Notably, Grok tends to overestimate scenarios with Open-Source Intelligence (OSINT) signals, highlighting the need for mitigation strategies. To address this, the models were made ‘blind’ to their previous outputs when shown current probabilities.
Further findings revealed that using named rules in prompts led to models citing these rules as shortcuts rather than engaging in actual reasoning. Additionally, while Google Search grounding prevented source hallucination, it did not prevent content hallucination, as evidenced by a model fabricating a $138 oil price while correctly citing Bloomberg as the source.
The system focuses on three active theaters: Iran, Taiwan, and Artificial General Intelligence (AGI), with a ‘Black Swan’ tab highlighting high-severity, low-probability scenarios across these areas. For more detailed insights and prompt engineering lessons, visit the devblog. The project is also available at doomclock.app.
Photo by lebih dari ini on Pexels
Photos provided by Pexels
