Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • the number of simultaneous requests is often very high (up to millions of requests a day)
  • requests often come from multiple IP addresses simultaneously. In some cases, over 200 different IP addresses were used by the same harvester to make simultaneous requests
  • harvesters sometimes do not follow robots.txt restrictions
  • the User-Agent string does not always declare that the user-agent is a bot
  • the User-Agent string is often changed for each request, so as to make blocking based on user agent string difficult - it is sometimes hard or impossible to tell harvester traffic from legitimate traffic

However, there are at some observed differences from DDOS attacks:

...

  • Manual costs.  A significant number of human hours must be invested across the organization, ranging from work done by administrators responsible for deciding on a response and the risk profile of harvesting, and communicating their decisions to stakeholders, to network engineers, systems administrators, and software developers who are tasked with monitoring and implementing measures to lessen the impact on their services.  These costs can be measured in financial terms, and also considered lost opportunity costs:  every hour spent addressing AI harvesting is an hour not spent on something else more directly related to the institution's mission.
  • Exposure of protected materials.  Some content may not be meant to be harvested, due to copyright restrictions or privacy concerns.
  • Redirection/"hijacking" of managed content.  AI summaries insert themselves between a user seeking information and sources of the information, resulting in lower visits to the source websites where the information is created and maintained.  Although this is primarily a concern to sites that rely on traffic for revenue, it also makes the impact of an institution's online presence more difficult, if not impossible, to measure.
  • Trickle-down effects to backend services.  Backend API calls to third-party resources, for example, may be triggered by bot requests, which can have the effect of exhausting API rate quotas or ballooning API usage costs.
Measures taken to address aggressive AI harvests

...

  • "Whack-a-mole". AI bots that are slowing down a service or site are identified when service begins to degrade, and their source IP ranges are blocked. This measure is only temporarily effective, as many harvesters simply spawn new bots on new source subnets. This approach also requires substantial manual intervention, to identify the bots and subnets, implement the block, and verify the results.
  • Geographical blacklisting.  Entire regions that have been identified as the source of abnormally high bot traffic are blocked entirely, if their IP subnets can be identified.  This carries the risk of blocking legitimate traffic, and generally is only temporarily effective, and does not stop all harvesting.
  • Throttling.  Harvesters are not blocked, but limits are put in place to reduce the number of requests that can be made in a given time frame. Throttling is often implemented at the level of subnets, in order to target bot networks specifically.  This measure also has the risk of slowing down or temporarily blocking legitimate users.
  • Web Application Firewall (WAF) products.  Institutions contract with vendors to provide request filtering in front of their websites.
  • Increase hardware/infrastructure resources.  This approach welcomes AI harvesting, and scales up server CPU, memory, and bandwidth to reduce strain on the systems from increased traffic. This approach is mostly useful where the institution's policy encourages unfettered access to its collections, and there is the money, infrastructure, and expertise to quickly scale up computing resources to meet demand.  There is the risk that increasing resources only increases demand, leading perpetual resource starvation.
Service providers and products that filter AI harvester traffic
Community projects
Other Resources

Good description of the problem and attempts at remediation from a technical perspective: https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html

Code4Lib Slack channel on bots: https://code4lib.slack.com/archives/C074PDZQX4G

...

https://shelf.io/blog/metadata-unlocks-ais-superpowers/

https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

A general guide to preventing web scraping, with a good discussion of do's and don'ts: https://github.com/JonasCz/How-To-Prevent-Scraping/blob/master/README.md