Problem statement

AI harvesting agents - also known as crawlers, bots, or spiders - are targeting memory institutions (galleries, libraries, archives and museums) to crawl their digital collections and retrieve their content for large language model training. This is not a new problem in itself: search engine crawlers have been doing much the same thing for many years to build their search indexes.  However, this was mitigated by the cost of harvest and the relatively few number of agents involved. The rise in success of large language models (i.e. AI systems such as ChatGPT) has spawned many competitors, all of whom are extremely eager to obtain the kind of content which memory institutions often seek to make freely available: collections of academic papers, for example, well described by human created metadata.

The result is a rise in traffic to institutions as these harvesting agents seek to retrieve the entire freely available contents of a site. The resulting traffic can so impede the service that it is no longer able to function properly, becomes very slow, or goes offline completely. In many ways, this behaviour might be equivalent to a Distributed Denial of Service attack (cf. description of a DDOS attack from Cloudflare). While few episodes show all of these behaviours, the following are commonly found:

However, there are at some observed differences from DDOS attacks:

Aggressive AI Harvests are not DDOS attacks

It is important distinguish to Aggressive AI Harvesting from DDOS attacks. Calling a harvest a DDOS, and requesting the usual networking/infrastructure level responses to a DDOS attack, may not generate the desired response. For example, the fact that a service remains available under a high load might suggest, to an infrastructure or network administrator, that this is a DDOS problem that has been solved or managed when it has not. Alternatively, an initial response might be to treat the harvester as malicious, which might not be appropriate; respectful, considerate crawling and harvesting is not undesirable, per se.  Many institutions encourage others to freely download and use their resources as they see fit, in line with institutional goals to support the free dissemination of knowledge. Therefore, for example,  treating the ByteDance harvester as malicious may not be desirable, even though it can be quite aggressive and uses multiple IP addresses for its agents across multiple subnets.

Useful Links

These were harvested from discussions on zoom and may find a better home on the document once the page develops but for now. this is a dumping ground for useful links on the subject


https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/

https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/MANUAL-CONFIGURATION.md

https://shelf.io/blog/metadata-unlocks-ais-superpowers/