Aggressive AI Harvesting of Digital Resources

Problem statement

AI harvesting agents - also known as crawlers, bots, or spiders - are targeting memory institutions (galleries, libraries, archives and museums) to crawl their digital collections and retrieve their content for large language model training. This is not a new problem in itself: search engine crawlers have been doing much the same thing for many years to build their search indexes. However, this was mitigated by the cost of harvest and the relatively few number of agents involved. The rise in success of large language models (i.e. AI systems such as ChatGPT) has spawned many competitors, all of whom are extremely eager to obtain the kind of content which memory institutions often seek to make freely available: collections of academic papers, for example, well described by human created metadata.

The result is a rise in traffic to institutions as these harvesting agents seek to retrieve the entire freely available contents of a site. The resulting traffic can so impede the service that it is no longer able to function properly, becomes very slow, or goes offline completely. In many ways, this behaviour might be equivalent to a Distributed Denial of Service attack (cf. description of a DDOS attack from Cloudflare). While few episodes show all of these behaviours, the following are commonly found:

the number of simultaneous requests is often very high (up to millions of requests a day)
requests often come from multiple IP addresses simultaneously. In some cases, over 200 different IP addresses were used by the same harvester to make simultaneous requests
harvesters sometimes do not follow robots.txt restrictions
the User-Agent string does not always declare that the user-agent is a bot
the User-Agent string is often changed for each request, so as to make blocking based on user agent string difficult - it is sometimes hard or impossible to tell harvester traffic from legitimate traffic

However, there are at some observed differences from DDOS attacks:

the harvesters will often reduce the volume of requests, or pause their activities, if a site goes offline or slows down
the harvesters will rarely take a site down for any great length of time, even when - if their intentions were to do so - this would be straightforward.
each individual request is generally one that might be generated by a normal user, and so appears legitimate: no attempts are being made to compromise the server or its resources, for example with SQL injection.

Aggressive AI Harvests are not DDOS attacks

It is important distinguish to Aggressive AI Harvesting from DDOS attacks. Calling a harvest a DDOS, and requesting the usual networking/infrastructure level responses to a DDOS attack, may not generate the desired response. For example, the fact that a service remains available under a high load might suggest, to an infrastructure or network administrator, that this is a DDOS problem that has been solved or managed when it has not. Alternatively, an initial response might be to treat the harvester as malicious, and block the harvester and all associated agents. This might not be appropriate; respectful, considerate crawling and harvesting is not undesirable, per se. Many institutions encourage others to freely download and use their resources as they see fit, in line with institutional goals to support the free dissemination of knowledge. Therefore, for example, treating the ByteDance harvester as malicious may not be desirable, even though it can be quite aggressive and uses multiple IP addresses for its agents across multiple subnets.

An added complication is that is difficult to distinguish "good" actors from "bad" actors at this early date, partly because what constitutes good vs. bad harvesting varies by institution, and partly because there are not as yet any established norms for dictating what constitutes good behavior that a responsible bot can follow (example: rules in robot.txt files are largely irrelevant to, and ignored by, AI bots).

Corollary harm that can be caused by aggressive AI harvesting

Manual costs. A significant number of human hours must be invested across the organization, ranging from work done by administrators responsible for deciding on a response and the risk profile of harvesting, and communicating their decisions to stakeholders, to network engineers, systems administrators, and software developers who are tasked with monitoring and implementing measures to lessen the impact on their services. These costs can be measured in financial terms, and also considered lost opportunity costs: every hour spent addressing AI harvesting is an hour not spent on something else more directly related to the institution's mission.
Exposure of protected materials. Some content may not be meant to be harvested, due to copyright restrictions or privacy concerns.

Measures taken to address aggressive AI harvests

Most institutions are relying on some combination of the following measures to manage AI harvester traffic.

"Whack-a-mole". AI bots that are slowing down a service or site are identified when service begins to degrade, and their source IP ranges are blocked. This measure is only temporarily effective, as many harvesters simply spawn new bots on new source subnets. This approach also requires substantial manual intervention, to identify the bots and subnets, implement the block, and verify the results.
Geographical blacklisting. Entire regions that have been identified as the source of abnormally high bot traffic are blocked entirely, if their IP subnets can be identified. This carries the risk of blocking legitimate traffic, and generally is only temporarily effective, and does not stop all harvesting.
Throttling. Harvesters are not blocked, but limits are put in place to reduce the number of requests that can be made in a given time frame. Throttling is often implemented at the level of subnets, in order to target bot networks specifically. This measure also has the risk of slowing down or temporarily blocking legitimate users.
Increase hardware/infrastructure resources. This approach welcomes AI harvesting, and scales up server CPU, memory, and bandwidth to reduce strain on the systems from increased traffic. This approach is mostly useful where the institution's policy encourages unfettered access to its collections, and there is the money, infrastructure, and expertise to quickly scale up computing resources to meet demand. There is the risk that increasing resources only increases demand, leading perpetual resource starvation.

Useful Links

Service providers that filter AI harvester traffic

Cisco AI Defense: a suite of applications designed to manage both internal and external AI traffic. Mostly suitable for entities deploying other Cisco products in their infrastructure.
AWS Web Application Firewall
Cloudflare (Cloudflare filtering in Ruby webapps: https://bibwild.wordpress.com/2025/01/16/using-cloudflare-turnstile-to-protect-certain-pages-on-a-rails-app/)
F5 Big-IP Application Security Manager

Other Resources

Code4Lib Slack channel on bots: https://code4lib.slack.com/archives/C074PDZQX4G

These were harvested from discussions on zoom and may find a better home on the document once the page develops but for now. this is a dumping ground for useful links on the subject

https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/

https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/MANUAL-CONFIGURATION.md

https://shelf.io/blog/metadata-unlocks-ais-superpowers/

Space shortcuts

Page tree