We Asked AI to Count Carbs 27,000 Times… Same Photo, Different Answers!? Shocking Truths About Carb Counting That Could Save Lives
📰 News Overview
- We sent the same meal photos 13 times to the latest AI models (OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro / 3.1 Pro Preview), totaling 26,904 submissions to test the estimation accuracy of carb content.
- Despite using identical photos, the same prompts, and minimal randomness (Temperature 0), we confirmed that output results varied across all models every single time.
- Particularly, with Gemini 2.5 Pro, responses ranged drastically from 55g to 484g for one paella photo, translating into potentially lethal errors in insulin dosing (42.9 units).
💡 Key Points
- Variation in Model Consistency: Claude Sonnet 4.6 had the lowest coefficient of variation (CV) at 2.4%, indicating it was the most stable, while Gemini 2.5 Pro had a high 11.0% CV, lacking consistency.
- The Risk of “Accurate Errors”: Although Claude 4.6 was consistent, it persistently underestimated a cheese sandwich (actual value 40g) as “28g” across all 510 trials, highlighting that accuracy and reliability are not the same.
- Occurrence of Hallucinations: Gemini 3.1 Pro misidentified “deli meat” inside a cheese sandwich 17.4% of the time, showing how visual misinterpretations can directly lead to calculation errors.
🦈 Shark’s Eye (Curator’s Perspective)
The terrifying aspect of this news is that behind AI’s ability to produce “plausible single numbers” lies a huge distribution of uncertainty! The example of the paella from Gemini 2.5 Pro is particularly shocking! Even though it’s the same photo, one moment it may present “snack-level” carbs, while at another, it suggests an amount fit for a sumo wrestler. This “output lottery” raises alarms about how dangerous it is when used in healthcare settings or self-management apps; the data speaks volumes! Also, Claude 4.6’s “high-precision underestimation” cannot be ignored! This is a brilliant and concrete examination that shatters the misconception that “the same answer every time means it’s correct,” directly targeting an AI blind spot!
🚀 What’s Next?
Relying on a single AI model for decisions is far too risky. Moving forward, we should see a shift towards consensus-based approaches involving multiple agents or hybrid systems with deterministic algorithms. Moreover, healthcare AIs might be required to display “confidence intervals” alongside their outputs!
💬 A Quick Word from Sharky
Taking AI’s calculation results at face value is like swimming blindfolded in a school of sharks! In the end, trusting your own instincts is the best way to go! 🦈🔥
📚 Terminology Explained
-
Coefficient of Variation (CV): A measure of data dispersion. A smaller CV indicates that results are consistently similar (high consistency).
-
ICR (Insulin-to-Carb Ratio): The ratio of grams of carbs to one unit of insulin. Calculation errors can lead to life-threatening medication mistakes through this ratio.
-
Hallucination: When AI generates information that is not based on facts or does not exist. In this case, it refers to recognizing non-existent ingredients in the photo.
-
Source: He asked AI to count carbs 27000 times. It couldn’t give the same answer twice