Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

See Trading reviews on Pull Requests for how to get immediate attention to that PR!

Notes

  • No updates on DSpace-CRIS merger discussions.  Discussions are still ongoing, but nothing significant to share at this time
  • No updates on 10.0 planning at this time either. But, plenty of work on the board needing review/testing (more on that below)
  • Discussion with Google Scholar team around helping DSpace sites to manage aggressive bots.
    • Discussion Ticket: https://github.com/DSpace/dspace-angular/issues/4565
    • Tim had a discussion with Google Scholar (GS) team this week.
    • They reached out to talk about how their bots are being accidentally blocked more and more by repositories (not just DSpace, but DSpace is still the most used system they encounter).  They are also finding more and more repositories are crashing or behaving slowly (likely because they are overwhelmed with bot traffic)
    • They understand why... everyone is having to combat / protect themselves against the aggressive harvesting bots (especially AI bots). Even GS is needing to protect themselves against these bots.
    • The worst of the aggressive bots do not obey any rules. They ignore robots.txt, they harvest as fast as possible (until site crashes), they don't identify themselves, they even may change IPs frequently.
    • GS was asking if there's anything we (as DSpace Developers) can do to help the DSpace sites out there that are having issues dealing with these bots on their own.
    • Could we brainstorm ways to either
      • Provide documentation on common strategies to use to alleviate the effect of these bad bots.  What are small things that sites can do that have an impact?
      • Perhaps provide tools built into DSpace which can help to alleviate some of the effects of these bots?  E.g. improve our built-in rate limiter, or provide a better "out of the box" configuration for it.
    • As we all know (and GS admitted) there's no perfect solution.  So, anything we provide to DSpace users won't be perfect for all sites (especially larger ones).  But, if we can simply give smaller sites better advice, it might increase the chance their site doesn't crash when the bad bots arrive, and decrease the chance they accidentally block good bots (like GS).
    • Brainstorms from our meeting:
      • Could we ask Google Scholar to limit their bot to just one per site?  Some sites see several at once.
      • Could we address this by continuing to address performance in DSpace?  If DSpace has improved performance it helps our users and also decreases the likelihood that bot attacks will take down a site.
        • We all agree performance is something we have to continue to improve on.  Tim asks if we can start to create more tickets around areas of the application that need performance improvements.  These improvements are likely to be incremental, so it will take likely lots of developers making small improvements to provide big impacts.
        • Will this actually address the problem of bad bots?  The bad bots seem to be harvesting as fast as possible...if DSpace gets faster, the bots will just harvest faster.
        • Maybe it needs to be combined with a rate limiter type of solution
      • Improving rate-limiter
        • We have a built in rate-limiter, but it's unclear if it's beneficial.  It's configuration may need looking at.  Or we could look at whether other rate limiters are more beneficial
        • Could we consider adding a Captcha to it?  It could let us lower the limit significantly, and if a human hits the limit, they can get past it via a Captcha.
      • Improvements to robots.txt from 4Science
        • 4Science has made some improvements to robots.txt which seem to benefit Google Scholar bots.  They can share this back to DSpace.
        • This won't fix behavior of bad bots, but it still would make good bots more efficient.  So, it'd still be great to add to DSpace.
      • Sharing what others are doing that has shown a good benefit:
    • Other Topics

Action items