Problem statement
AI harvesting agents - also known as crawlers, bots, or spiders - are targeting memory institutions (galleries, libraries, archives and museums) to crawl their digital collections and retrieve their content for large language model training. This is not a new problem in itself: search engine crawlers have been doing much the same thing for many years to build their search indexes. However, this was mitigated by the cost of harvest and the relatively few number of agents involved. The rise in success of large language models (i.e. AI systems such as ChatGPT) has spawned many competitors, all of whom are extremely eager to obtain the kind of content which memory institutions often seek to make freely available: collections of academic papers, for example, well described by human created metadata.
The result is a rise in traffic to institutions as these harvesting agents seek to retrieve the entire freely available contents of a site. The resulting traffic can so impede the service that it is no longer able to function properly, becomes very slow, or goes offline completely. In many ways, this behaviour might be equivalent to a Distributed Denial of Service attack (cf. description of a DDOS attack from Cloudflare). While few episodes show all of these behaviours, the following are commonly found:
- the number of simultaneous requests is often very high
- requests often come from multiple IP addresses simultaneously. In some cases, over 200 different IP addresses were used by the same harvester to make simultaneous requests
- harvesters sometimes do not follow robots.txt restrictions
- the User-Agent string does not always declare that the user-agent is a bot
- the User-Agent string is often changed for each request, so as to make blocking based on user agent string difficult - it is sometimes hard or impossible to tell harvester traffic from legitimate traffic
However, there are at least two observed differences from DDOS attacks:
- the harvesters will often reduce the volume of requests, or pause their activities, if a site goes offline
- the harvesters will rarely take a site down for any great length of time, even when - if their intentions were to do so - this would be straightforward.
- each individual request is generally one that might be generated by a normal user, and so appears legitimate: no attempts are being made to compromise the server or its resources, for example with SQL injection.
Aggressive AI Harvests are not DDOS attacks
It is important distinguish to Aggressive AI Harvesting from DDOS attacks. Calling a harvest a DDOS, and requesting the usual networking/infrastructure level responses to a DDOS attack, may not generate the desired response. For example, the fact that a service remains available under a high load might suggest, to an infrastructure or network administrator, that this is a DDOS problem that has been solved or managed when it has not. Alternatively, an initial response might be to treat the harvester as malicious, which might not be appropriate; respectful, considerate crawling and harvesting is not undesirable, per se. Many institutions encourage others to freely download and use their resources as they see fit, in line with institutional goals to support the free dissemination of knowledge. Therefore, for example, treating the ByteDance harvester as malicious may not be desirable, even though it can be quite aggressive and uses multiple IP addresses for its agents across multiple subnets.
Useful Links
These were harvested from discussions on zoom and may find a better home on the document once the page develops but for now. this is a dumping ground for useful links on the subject
https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/MANUAL-CONFIGURATION.md