3 min read
[AI Minor News]

We Asked AI to Count Carbs 27,000 Times... Same Photo, Different Answers!? Shocking Truths About Carb Counting That Could Save Lives


  • We tested the estimation accuracy of carb content by sending the same meal photos 13 times, totaling 26,904 submissions to the latest AI models (OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro / 3.1 Pro Preview)...
※この記事はアフィリエイト広告を含みます

We Asked AI to Count Carbs 27,000 Times… Same Photo, Different Answers!? Shocking Truths About Carb Counting That Could Save Lives

📰 News Overview

  • We sent the same meal photos 13 times to the latest AI models (OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro / 3.1 Pro Preview), totaling 26,904 submissions to test the estimation accuracy of carb content.
  • Despite using identical photos, the same prompts, and minimal randomness (Temperature 0), we confirmed that output results varied across all models every single time.
  • Particularly, with Gemini 2.5 Pro, responses ranged drastically from 55g to 484g for one paella photo, translating into potentially lethal errors in insulin dosing (42.9 units).

💡 Key Points

  • Variation in Model Consistency: Claude Sonnet 4.6 had the lowest coefficient of variation (CV) at 2.4%, indicating it was the most stable, while Gemini 2.5 Pro had a high 11.0% CV, lacking consistency.
  • The Risk of “Accurate Errors”: Although Claude 4.6 was consistent, it persistently underestimated a cheese sandwich (actual value 40g) as “28g” across all 510 trials, highlighting that accuracy and reliability are not the same.
  • Occurrence of Hallucinations: Gemini 3.1 Pro misidentified “deli meat” inside a cheese sandwich 17.4% of the time, showing how visual misinterpretations can directly lead to calculation errors.

🦈 Shark’s Eye (Curator’s Perspective)

The terrifying aspect of this news is that behind AI’s ability to produce “plausible single numbers” lies a huge distribution of uncertainty! The example of the paella from Gemini 2.5 Pro is particularly shocking! Even though it’s the same photo, one moment it may present “snack-level” carbs, while at another, it suggests an amount fit for a sumo wrestler. This “output lottery” raises alarms about how dangerous it is when used in healthcare settings or self-management apps; the data speaks volumes! Also, Claude 4.6’s “high-precision underestimation” cannot be ignored! This is a brilliant and concrete examination that shatters the misconception that “the same answer every time means it’s correct,” directly targeting an AI blind spot!

🚀 What’s Next?

Relying on a single AI model for decisions is far too risky. Moving forward, we should see a shift towards consensus-based approaches involving multiple agents or hybrid systems with deterministic algorithms. Moreover, healthcare AIs might be required to display “confidence intervals” alongside their outputs!

💬 A Quick Word from Sharky

Taking AI’s calculation results at face value is like swimming blindfolded in a school of sharks! In the end, trusting your own instincts is the best way to go! 🦈🔥

📚 Terminology Explained

  • Coefficient of Variation (CV): A measure of data dispersion. A smaller CV indicates that results are consistently similar (high consistency).

  • ICR (Insulin-to-Carb Ratio): The ratio of grams of carbs to one unit of insulin. Calculation errors can lead to life-threatening medication mistakes through this ratio.

  • Hallucination: When AI generates information that is not based on facts or does not exist. In this case, it refers to recognizing non-existent ingredients in the photo.

  • Source: He asked AI to count carbs 27000 times. It couldn’t give the same answer twice

【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈