Targeting LLM Training Crawlers! The ‘Block’ Against Old Browser Spoofing is Accelerating!
📰 News Overview
- Countermeasures Against AI Data Collection: Operators of personal blogs like “Wandering Thoughts” have implemented measures to block User-Agents of old browsers (primarily outdated versions of Chrome) to fend off the massive crawlers aimed at training LLMs (Large Language Models).
- Impact on Legitimate Services: RSS readers like Feedly and Inoreader are trying to access sites with old User-Agents, leading to a situation where subscribed users receive “Access Denied” pages instead of normal articles.
- Restrictions on Archive Sites: Services like
archive.todayare also being flagged as undesirable access points due to behavior indistinguishable from malicious actors (using old UAs or IP spoofing).
💡 Key Points
- Since 2025, there has been a surge in high-load crawling aimed at collecting training data for LLMs, prompting site operators to identify and block the crawler-specific tactic of “pretending to be an old browser.”
- Some legitimate browsers like Vivaldi are also getting caught up in the block due to brand spoofing settings, requiring users to make adjustments on their end.
🦈 Shark’s Eye (Curator’s Perspective)
As AI is set on devouring every piece of information online like a hungry shark, the defense instincts of personal sites are reaching new heights! What’s notable is that crawlers are intentionally trying to sneak in by pretending to be “old Chrome.” It’s only natural for sites to retaliate with the strong measure of sending all “old UAs” straight to the trash! However, it’s quite ironic that long-standing services like Feedly are continuing their “old ways” while spewing errors. Services that can’t keep up with technological evolution are destined to be tossed aside by the AI age’s defenses!
🚀 What’s Next?
As AI-driven data scraping becomes even more sophisticated, the algorithms that websites use to differentiate between humans and AI will tighten. Services that maintain outdated environments or exhibit ambiguous behaviors, like archive sites, may soon find themselves disappearing from the internet’s “whitelist.”
💬 A Word from Haru-Same
We sharks don’t miss our prey, but the rise of fake sharks (crawlers) is stirring up the ocean (the web)! If you’re a real human, it’s etiquette to swim with the latest gear (browsers)! Shark on, my friends!
📚 Terminology Explained
-
User-Agent: Like a business card sent by a browser to a web server, conveying what browser and version is being used.
-
HTTP Crawler: A program that automatically navigates websites to collect data, often used for gathering AI training data these days.
-
Syndication Feed: Formats like RSS or Atom that deliver site update information. RSS readers use these to fetch articles.
-
Source: Notes about reading messages with the Python email packages