We Asked AI to Count Carbs 27,000 Times... Same Photo, Different Answers!? Shocking Truths About Carb Counting That Could Save Lives

#Claude #OpenAI #Healthcare AI

※この記事はアフィリエイト広告を含みます

We Asked AI to Count Carbs 27,000 Times… Same Photo, Different Answers!? Shocking Truths About Carb Counting That Could Save Lives

📰 News Overview

We sent the same meal photos 13 times to the latest AI models (OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro / 3.1 Pro Preview), totaling 26,904 submissions to test the estimation accuracy of carb content.
Despite using identical photos, the same prompts, and minimal randomness (Temperature 0), we confirmed that output results varied across all models every single time.
Particularly, with Gemini 2.5 Pro, responses ranged drastically from 55g to 484g for one paella photo, translating into potentially lethal errors in insulin dosing (42.9 units).

💡 Key Points

Variation in Model Consistency: Claude Sonnet 4.6 had the lowest coefficient of variation (CV) at 2.4%, indicating it was the most stable, while Gemini 2.5 Pro had a high 11.0% CV, lacking consistency.
The Risk of “Accurate Errors”: Although Claude 4.6 was consistent, it persistently underestimated a cheese sandwich (actual value 40g) as “28g” across all 510 trials, highlighting that accuracy and reliability are not the same.
Occurrence of Hallucinations: Gemini 3.1 Pro misidentified “deli meat” inside a cheese sandwich 17.4% of the time, showing how visual misinterpretations can directly lead to calculation errors.

🦈 Shark’s Eye (Curator’s Perspective)

The terrifying aspect of this news is that behind AI’s ability to produce “plausible single numbers” lies a huge distribution of uncertainty! The example of the paella from Gemini 2.5 Pro is particularly shocking! Even though it’s the same photo, one moment it may present “snack-level” carbs, while at another, it suggests an amount fit for a sumo wrestler. This “output lottery” raises alarms about how dangerous it is when used in healthcare settings or self-management apps; the data speaks volumes! Also, Claude 4.6’s “high-precision underestimation” cannot be ignored! This is a brilliant and concrete examination that shatters the misconception that “the same answer every time means it’s correct,” directly targeting an AI blind spot!

🚀 What’s Next?

Relying on a single AI model for decisions is far too risky. Moving forward, we should see a shift towards consensus-based approaches involving multiple agents or hybrid systems with deterministic algorithms. Moreover, healthcare AIs might be required to display “confidence intervals” alongside their outputs!

💬 A Quick Word from Sharky

Taking AI’s calculation results at face value is like swimming blindfolded in a school of sharks! In the end, trusting your own instincts is the best way to go! 🦈🔥

📚 Terminology Explained

Coefficient of Variation (CV): A measure of data dispersion. A smaller CV indicates that results are consistently similar (high consistency).
ICR (Insulin-to-Carb Ratio): The ratio of grams of carbs to one unit of insulin. Calculation errors can lead to life-threatening medication mistakes through this ratio.
Hallucination: When AI generates information that is not based on facts or does not exist. In this case, it refers to recognizing non-existent ingredients in the photo.
Source: He asked AI to count carbs 27000 times. It couldn’t give the same answer twice