Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahooother search engines. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.

DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup.  Here's how to ensure your site is indexed.

Info
titleBasic SEO validation is now provided in the DSpace Admin User Interface

As of DSpace 9.0, DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the below SEO policies.  At this time, this validation tool can only check three things: (1) your site is using server-side-rendering (SSR), (2) your site has sitemaps enabled and they appear to be working, (3) your site has a robots.txt which links to your sitemaps.

This validation tool can be found in the Admin User Interface on the "Health" page.  Look for the section named "SEO".  If everything looks good, you'll see a green checkbox.  If there is feedback to address, you'll see a red warning with details on what needs to be addressed.

For the optimum indexing, you should:

  1. Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
  2. Ensure your DSpace is visible to search engines.
  3. Ensure your proxy is passing X-Forwarded headers to the User Interface

  4. Ensure the user interface is using server-side rendering (enabled by default)
  5. Ensure the sitemaps feature is enabled. (enabled by default)
  6. Ensure your robots.txt allows access to item "splash" pages and full text.
  7. Ensure item metadata appears in HTML headers correctly.
  8. Avoid redirecting file downloads to Item landing pages
  9. Turn OFF any generation of PDF cover pages
  10. As an aside, it's worth noting that OAI-PMH is generally not useful to search engines.  OAI-PMH has its own uses, but do not expect search engines to use it.

Keep your DSpace up to date

We are constantly adding new indexing improvements to DSpace.  In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:

For the optimum indexing, you should:

  1. Check SEO Validator status to detect any obvious issues

  2. Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
  3. Ensure your DSpace is visible to search engines.
  4. Ensure your proxy is passing X-Forwarded headers to the User Interface

  5. Ensure the user interface is using server-side rendering (enabled by default)
  6. Ensure the sitemaps feature is enabled. (enabled by default)
  7. Ensure your robots.txt allows access to item "splash" pages and full text.
  8. Ensure item metadata appears in HTML headers correctly.
  9. Avoid redirecting file downloads to Item landing pages
  10. Turn OFF any generation of PDF cover pages
  11. As an aside, it's worth noting that OAI-PMH is generally not useful to search engines.  OAI-PMH has its own uses, but do not expect search engines to use it.

Check SEO Validator status

DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the some of these Search Engine Optimization policies. 

At this time, this validation tool can only check three things:

This validation tool can be found in the Admin User Interface on the "Health" page.  Look for the section named "SEO".  If everything looks good, you'll see a green checkbox similar to this:

Image Added

If there are issues detected, you'll see a red warning with details on what needs to be addressed.

Image Added

If issues are detected, you should use the documentation on this wiki page to address the detected issues.

Note

Even if you see a green checkmark on this page, you should still review all the Search Engine Optimization guidelines on this page.  As noted above, this validator cannot detect all possible SEO issues, so manual verification is still required.

Keep your DSpace up to date

We are constantly adding new indexing improvements to DSpace.  In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:

  • As of DSpace 9.0, 8.2, and 7.6.4, a basic SEO validator is now provided on the "Health" page.
  • As of DSpace 7.0, Sitemaps are enabled by default (see below)
  • As of DSpace 5.0, the DSpace robots.txt file now includes references to Sitemaps by default
  • As of DSpace 7.0, Sitemaps are enabled by default (see below)
  • As of DSpace 5.0, the DSpace robots.txt file now includes references to Sitemaps by default (see https://github.com/DSpace/DSpace/issues/5302), and also blocks known bad bots (see https://github.com/DSpace/DSpace/issues/5701).
  • As of DSpace 4.0, DSpace has provided several enhancements, which were requested by the Google Scholar team. These included providing users (and web indexers) a way to browse content by the date it was added to DSpace (see https://github.com/DSpace/DSpace/issues/4851), ensuring the "dc.date.issued" field is set more accurately 5302), and also blocks known bad bots (see https://github.com/DSpace/DSpace/issues/4850), and enhancing the logic behind the "citation_pdf_url" HTML <meta> tag (see https://github.com/DSpace/DSpace/issues/4852)5701).
  • As of DSpace 14.70, DSpace has provided several enhancements, which were requested by the Google Scholar team. These included providing users (and web indexers) a way to browse content by the date it was added to DSpace (see https://github.com/DSpace/DSpace/issues/4851), ensuring the "dc.date.issued" field is set more accurately (see https://github.com/DSpace/DSpace/issues/4850), and enhancing the logic behind the "citation_pdf_url" HTML <meta> tag (see https://github.com/DSpace/DSpace/issues/4852)
  • As of DSpace 1.7, DSpace has improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) that Google Scholar recommends in their Indexing Guidelines.
  • As of DSpace 1.5, DSpace has support for sitemaps (both simple HTML pages of links, as well as the sitemaps.org protocol). It also includes item metadata in the HTML HEAD element of item display pages, ensuring that the metadata can be effectively indexed no matter what changes you might have made to your DSpace's layout or style.
  • As of DSpace 1.4, DSpace has support for the "if-modified-since" HTTP header. This basically means that if an item (or bitstream therein) has not changed since the last time a search engine's crawler indexed it, that item/bitstream does not have to be re-retrieved, sparing your server.

...

Ensure your proxy is passing X-Forwarded headers to the User Interface

...

Code Block
User-agent: * 
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search
Disallow: /admin/*
Disallow: /processes

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search

Disallow: /admin/*
Disallow: /processes

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

For more information on the robots.txt format, please see the Google Robots.txt documentation.

Ensure Item Metadata appears in the HTML HEAD

It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace  includes item metadata in the <head> element of each item's HTML display page.

Code Block
<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" />

...

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search

Disallow: /admin/*
Disallow: /processes

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

For more information on the robots.txt format, please see the Google Robots.txt documentation.

Ensure Item Metadata appears in the HTML HEAD

Google Scholar Metadata in HTML HEAD

...

These meta tags are the "Highwire Press tags" which Google Scholar recommends.  If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, you may do so by modifying https://github.com/DSpace/dspace-angular/blob/main/src/app/core/metadata/metadatahead-tag.service.ts (see for example the "setCitationAuthorTags()" method in that service class)

...