Stop the Practical Collapse! New Metric 'SOB' for Assessing AI's Structured Output Released

#GPT-5.4 #Benchmark #Structured Data

※この記事はアフィリエイト広告を含みます

Stop the Practical Collapse! New Metric ‘SOB’ for Assessing AI’s Structured Output Released

📰 News Summary

New metric focusing on “Value Accuracy”: The “Structured Output Benchmark (SOB)” has been released, evaluating not just whether the JSON format is correct, but also whether the extracted values themselves are accurate.
Integrated evaluation of three modalities: The ability to extract structured data from text (HotpotQA), images (olmOCR-bench), and audio (AMI Meeting Corpus) is measured through the same pipeline.
Latest model rankings revealed: GPT-5.4 takes the top spot overall. Meanwhile, GLM-4.7 leads in perfect response rate, highlighting the unique strengths of each model.

💡 Key Points

Seven evaluation metrics: The analysis is multi-faceted, focusing on Value Accuracy, JSON Pass, Type Safety, Structure Coverage, Path Recall, Faithfulness, and Perfect Response.
Weighted by difficulty: Weights are set from Easy (1.0) to Hard (3.0) based on schema complexity. This rigorously evaluates the ability to handle not just simple data extraction but also nested complex structures.
Practical evaluation focus: The adoption of “Value Accuracy” as the primary metric underscores its importance in preventing downstream system breakdowns.

🦈 Shark’s Eye (Curator’s Perspective)

This SOB is set to be the benchmark that serves as a “compass” in the age of AI agents! While previous AIs were accustomed to shaping data into “JSON format,” if the values inside are hallucinations (aka lies), the systems that receive them will crash and burn. SOB is brilliantly targeting that issue!

What’s particularly noteworthy is the normalization of images and audio into text, leveling the evaluation playing field. This allows for a pure comparison of “structuring capability.” While GPT-5.4 ranks first overall, the high “Perfect Response” rate of GLM-4.7 shouldn’t be overlooked. It suggests that it could outperform GPT in specific applications!

🚀 What’s Next?

Model developers will need to shift their focus from merely fixing output formats to enhancing the precision of “Value Grounding” related to source data. With the introduction of SOB, the selection of “unbreakable AI” ready for practical deployment will accelerate.

💬 A Word from Haru-Same

Just shaping the outer form while being hollow inside is like a fish in a shark suit! True “hardcore structuring” is the trend for 2026! 🦈🔥

📚 Terminology Explained

Value Accuracy: The percentage of extracted final values that exactly match the correct data. This is the most critical metric for reliability in practical applications.
Type Safety: Ensures that each output value matches the predefined data types in the JSON Schema (e.g., string, number, array).
Faithfulness: The degree to which output values are based solely on the input source context, rather than learning data (i.e., they do not create content independently).
Source: Show HN: A new benchmark for testing LLMs for deterministic outputs