Problem statement

AI harvesting agents - also known as crawlers, bots, or spiders - are targeting memory institutions (galleries, libraries, archives and museums) to crawl their digital collections and retrieve their content for large language model training. This is not a new problem in itself: search engine crawlers have been doing much the same thing for many years to build their search indexes.  However, this was mitigated by the cost of harvest and the relatively few number of agents involved. The rise in success of large language models (i.e. AI systems such as ChatGPT) has spawned many competitors, all of whom are extremely eager to obtain the kind of content which memory institutions often seek to make freely available: collections of academic papers, for example, well described by human created metadata.

The result is a rise in traffic to institutions as these harvesting agents seek to retrieve the entire freely available contents of a site. The resulting traffic can so impede the service that it is no longer able to function properly, becomes very slow, or goes offline completely. In many ways, this behaviour might be equivalent to a Distributed Denial of Service attack (cf. description of a DDOS attack from Cloudflare). While few episodes show all of these behaviours, the following are commonly found:

However, there are some observed differences from DDOS attacks:

Aggressive AI Harvests are not DDOS attacks

It is important distinguish to Aggressive AI Harvesting from DDOS attacks. Calling a harvest a DDOS, and requesting the usual networking/infrastructure level responses to a DDOS attack, may not generate the desired response. For example, the fact that a service remains available under a high load might suggest, to an infrastructure or network administrator, that this is a DDOS problem that has been solved or managed when it has not. Alternatively, an initial response might be to treat the harvester as malicious, and block the harvester and all associated agents. This might not be appropriate; respectful, considerate crawling and harvesting is not undesirable, per se.  Many institutions encourage others to freely download and use their resources as they see fit, in line with institutional goals to support the free dissemination of knowledge. Therefore, for example,  treating the ByteDance harvester as malicious may not be desirable, even though it can be quite aggressive and uses multiple IP addresses for its agents across multiple subnets.

An added complication is that is difficult to distinguish "good" actors from "bad" actors at this early date, partly because what constitutes good vs. bad harvesting varies by institution, and partly because there are not as yet any established norms for dictating what constitutes good behavior that a responsible bot can follow (example: rules in robot.txt files are largely irrelevant to, and ignored by, AI bots).

Corollary harm that can be caused by aggressive AI harvesting
Measures taken to address aggressive AI harvests

Most institutions are relying on some combination of the following measures to manage AI harvester traffic.

Useful Links

Service providers and products that filter AI harvester traffic
Community projects
Other Resources

Good description of the problem and attempts at remediation from a technical perspective: https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

Code4Lib Slack channel on bots: https://code4lib.slack.com/archives/C074PDZQX4G

These were harvested from discussions on zoom and may find a better home on the document once the page develops but for now. this is a dumping ground for useful links on the subject

https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/

https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/MANUAL-CONFIGURATION.md

https://shelf.io/blog/metadata-unlocks-ais-superpowers/

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

A general guide to preventing web scraping, with a good discussion of do's and don'ts: https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md