You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Problem statement

AI bots are targeting memory institutions (galleries, libraries, archives and museums) to crawl their digital collections and retrieve their content for large language model training.  In many cases, the number of bots making requests, the number of requests being made by a bot in a short period of time, and the quantity of information returned in response all together are overwhelming the networking, CPU, and memory capacities of the platforms where the digital resources are hosted.  This results in what is, in effect, the equivalent of a denial-of-service attack on the platforms:  user experience degrades, making query and response times unacceptably slow, or crashing the systems altogether.

Some considerations that complicate the problem:

  • The source of many bots are not from one or even a handful of enterprises, but from all over, as the tools to harvest content and build LLMs are freely available.  This makes pinpointing a source difficult, if not impossible in some cases.
  • Some bots behave worse than others.  TikTok/ByteDance has been named as an especially aggressive, inconsiderate enterprise running harvests, for example.
  • Respectful, considerate crawling and harvesting is not undesirable, per se.  Many institutions encourage others to freely download and use their resources as they see fit, in line with institutional goals to support the free dissemination of knowledge.
  • It can be difficult to separate "real" user requests from automated bot requests.

These were harvested from discussions on zoom and may find a better home on the document once the page develops but for now. this is a dumping ground for useful links on the subject


https://creativecommons.org/2024/08/23/six-insights-on-preference-signals-for-ai-training/

https://www.haproxy.com/blog/nearly-90-of-our-ai-crawler-traffic-is-from-tiktok-parent-bytedance-lessons-learned

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/MANUAL-CONFIGURATION.md

https://shelf.io/blog/metadata-unlocks-ais-superpowers/

  • No labels