AI Crawlers and Application Security for websites.

From Drupal.gov con 2025 — A workshop on AI Crawlers and their impact on US Federal websites.

–“AI Crawlers Are Breaking .Gov Websites — What You Can Do About It

Over the past year, a new kind of bot has shown up in our server logs—AI-powered crawlers scraping public government websites to feed large language models. These bots aren’t malicious, but they hit fast, often ignore robots.txt, and can overwhelm infrastructure not built to handle that level of traffic.

In this talk, we’ll look at what’s really happening behind the scenes: how LLM crawlers are targeting .gov sites, what patterns to watch for, and what happens when your site can’t keep up. We’ll cover practical steps you can take to detect, throttle, or block these crawlers using tools like CDNs, WAFs, and server-level protections. That includes how to configure your Drupal site to manage caching, permissions, and crawl behavior to reduce exposure and load.

We’ll also talk about the policy side—when blocking is appropriate, what public content should be exposed, and how to talk to stakeholders about the risks.

Whether you’re a developer, architect, or program manager, you’ll leave with a clear understanding of the problem and a checklist of actions to protect your site without compromising on transparency.”

Lets pause for a minute to consider the following statement:
–“These bots aren’t malicious, but they hit fast, often ignore robots.txt, and can overwhelm infrastructure not built to handle that level of traffic.”

This statement is problematic for me for the following reasons. Any bot or scraper that ignores robots.txt should be considered a “bad bot”, as robots.txt exists to control automated access to a websites content. Intentionally stepping around this protocol is an ethical decision, and in my opinion, a poor one.

It may be helpful in the future to develop another mechanism to control access to content via scraping.

The Google Content Scraper, or googlebot is a good example of how scraping should be accomplished, and is a good example of a “good bot”. It’s requests have a relatively low periodicity (usually more than two seconds per request). Googlebot is nice enough to pay attention to robots.txt and googlebot does not hit your site multiple times per day. You never know it’s there.

According to industry leader Akamai Technologies bad bots are characterized by two factors, intent and behavior. From Akamai official documentation:
–“”The differences between a good bot and a bad bot are intent and behavior. Bad bots have malicious intent — like stealing data, overloading servers, or performing fraudulent activities. They’re found in every industry, from games to healthcare, targeting both corporations and individual end users.

Since Google has already defined this widely held industry standard, I am somewhat befuddled and curious to learn why there is not more focus in our industry on the intention and behavior of the AI Scrapers hitting our valuable sites and slowing them down.

I think what bothers me the most is that the AI Scrapers have changed my application security perspective in a radical way. For the first time in 27 years I have seen a single entity essentially “turn off” parts of the internet with a powerful new way of interacting with a webserver, regardless of the rules.

I can’t un-learn this. I keep thinking of that scene in Battlestar Galactica where they mention that they dont network the ships computers because the Cylons can crack through any firewall…

Interesting times, indeed.