Google Scholar wants to be a "good bot" and continue to harvest DSpace (and other repos). But, they are finding their bots are being accidentally blocked more frequently alongside the "bad bots".
Other topics
If you have a topic, add it here.
Board Review:
10.0 Project Board- Review PRs collaboratively or Assign new PRs to volunteers to code review and/or test.
Backlog Board- Are there any tickets here stuck in the "Triage" column? We'd like to keep this column as small as possible.
To quickly find PRs assigned to you for review, visit https://github.com/pulls/review-requested (This is also available in the GitHub header under "Pull Requests → Review Requests")
Deadline is TBD for 9.2, 8.3 and7.6.5. Bug fix releases do not have fixed/scheduled deadlines.Instead, the developer team will determine when to create a release based on the significance of the issues to solve. (e.g. If major issues are fixed, then a bug fix release will occur more rapidly. If minor issues are found, then a bug fix release may be delayed until sufficient fixes have been made to warrant a release)
Bug/security fixes only. These minor releases will not include any new features.
New "themeable components" (for dspace-angular) are allowed in bug fix releases, provided that they don't significantly modify component behavior or similar.
Accessibility fixes are also allowed in bug fix releases, provided they don't significantly modify component behavior or similar.
Bug fix PRsshould be created against "main" branch where possible. The "main" branch has the most strict code style rules. (i.e. PRs created against dspace-7_x are becoming more difficult to port forward.)
Per our support policy, bug fixes are only guaranteed to be ported back to 9.x. That said, where possible, we'll try to backport bug fixes (especially significant ones) to 8. x and 7.6.x.
Try "Pull Request Trading" for a quicker review
Do you have a PR stuck in "under review" that you really want to see move forward? Or maybe it's someone else's PR but you want to get it more attention?
Tim had a discussion with Google Scholar (GS) team this week.
They reached out to talk about how their bots are being accidentally blocked more and more by repositories (not just DSpace, but DSpace is still the most used system they encounter). They are also finding more and more repositories are crashing or behaving slowly (likely because they are overwhelmed with bot traffic)
They understand why... everyone is having to combat / protect themselves against the aggressive harvesting bots (especially AI bots). Even GS is needing to protect themselves against these bots.
The worst of the aggressive bots do not obey any rules. They ignore robots.txt, they harvest as fast as possible (until site crashes), they don't identify themselves, they even may change IPs frequently.
GS was asking if there's anything we (as DSpace Developers) can do to help the DSpace sites out there that are having issues dealing with these bots on their own.
Could we brainstorm ways to either
Provide documentation on common strategies to use to alleviate the effect of these bad bots. What are small things that sites can do that have an impact?
Perhaps provide tools built into DSpace which can help to alleviate some of the effects of these bots? E.g. improve our built-in rate limiter, or provide a better "out of the box" configuration for it.
As we all know (and GS admitted) there's no perfect solution. So, anything we provide to DSpace users won't be perfect for all sites (especially larger ones). But, if we can simply give smaller sites better advice, it might increase the chance their site doesn't crash when the bad bots arrive, and decrease the chance they accidentally block good bots (like GS).
Brainstorms from our meeting:
Could we ask Google Scholar to limit their bot to just one per site? Some sites see several at once.
Could we address this by continuing to address performance in DSpace? If DSpace has improved performance it helps our users and also decreases the likelihood that bot attacks will take down a site.
We all agree performance is something we have to continue to improve on. Tim asks if we can start to create more tickets around areas of the application that need performance improvements. These improvements are likely to be incremental, so it will take likely lots of developers making small improvements to provide big impacts.
Will this actually address the problem of bad bots? The bad bots seem to be harvesting as fast as possible...if DSpace gets faster, the bots will just harvest faster.
Maybe it needs to be combined with a rate limiter type of solution
Improving rate-limiter
We have a built in rate-limiter, but it's unclear if it's beneficial. It's configuration may need looking at. Or we could look at whether other rate limiters are more beneficial
Could we consider adding a Captcha to it? It could let us lower the limit significantly, and if a human hits the limit, they can get past it via a Captcha.
Improvements to robots.txt from 4Science
4Science has made some improvements to robots.txt which seem to benefit Google Scholar bots. They can share this back to DSpace.
This won't fix behavior of bad bots, but it still would make good bots more efficient. So, it'd still be great to add to DSpace.
Sharing what others are doing that has shown a good benefit:
Paulo's team is using "several caching levels. Apache Cache+Varnish and DSpace" to provide minimum service
https://github.com/DSpace/dspace-angular/pull/4207 - This needs review from a user experience / user interface perspective. Agreed this is a bug, but not sure about adding a checkbox next to the search box.
https://github.com/DSpace/DSpace/pull/10742 - It's important to use XOAUTH as SMTP authentication is being retired. This PR seems like a good start, but seems to have code specific to Google / GMail. It likely won't work for other authentication providers. Tim will add feedback to ticket & Giuseppe volunteered 4Science to look at it in more detail.