Page History

...

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahooother search engines. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.

DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup. Here's how to ensure your site is indexed.

Info

title	Basic SEO validation is now provided in the DSpace Admin User Interface

DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the below SEO policies.

For the optimum indexing, you should:

Check SEO Validator status to detect any obvious issues
Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
Ensure your DSpace is visible to search engines.
Ensure your proxy is passing X-Forwarded headers to the User Interface
Ensure the user interface is using server-side rendering (enabled by default)
Ensure the sitemaps feature is enabled. (enabled by default)
Ensure your robots.txt allows access to item "splash" pages and full text.
Ensure item metadata appears in HTML headers correctly.
Avoid redirecting file downloads to Item landing pages
Turn OFF any generation of PDF cover pages
As an aside, it's worth noting that OAI-PMH is generally not useful to search engines. OAI-PMH has its own uses, but do not expect search engines to use it.

Keep your DSpace up to date

We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:

As of DSpace 7.0, Sitemaps are enabled by default (see below)
As of DSpace 5.0, the DSpace robots.txt file now includes references to Sitemaps by default (see https://github.com/DSpace/DSpace/issues/5302), and also blocks known bad bots (see https://github.com/DSpace/DSpace/issues/5701).
As of DSpace 4.0, DSpace has provided several enhancements, which were requested by the Google Scholar team. These included providing users (and web indexers) a way to browse content by the date it was added to DSpace (see https://github.com/DSpace/DSpace/issues/4851), ensuring the "dc.date.issued" field is set more accurately (see https://github.com/DSpace/DSpace/issues/4850), and enhancing the logic behind the "citation_pdf_url" HTML <meta> tag (see https://github.com/DSpace/DSpace/issues/4852)
As of DSpace 1.7, DSpace has improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) that Google Scholar recommends in their Indexing Guidelines.
As of DSpace 1.5, DSpace has support for sitemaps (both simple HTML pages of links, as well as the sitemaps.org protocol). It also includes item metadata in the HTML HEAD element of item display pages, ensuring that the metadata can be effectively indexed no matter what changes you might have made to your DSpace's layout or style.
As of DSpace 1.4, DSpace has support for the "if-modified-since" HTTP header. This basically means that if an item (or bitstream therein) has not changed since the last time a search engine's crawler indexed it, that item/bitstream does not have to be re-retrieved, sparing your server.

Additional minor improvements / bug fixes have been made to more recent releases of DSpace.

Ensure your DSpace is visible to search engines

First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus

If your site is not indexed at all, all search engines have a way to add your URL, e.g.:

Google: http://www.google.com/addurl
Yahoo: http://siteexplorer.search.yahoo.com/submit
Bing: http://www.bing.com/docs/submit.aspx

Ensure your proxy is passing X-Forwarded headers to the User Interface

...

Check SEO Validator status

DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the some of these Search Engine Optimization policies.

At this time, this validation tool can only check three things:

Ensure the user interface is using server-side rendering
Ensure the sitemaps feature is enabled and appear to be configured correctly
Ensure your robots.txt exists and links correctly to the sitemaps (may also detect basic issues with a proxy not passing X-Forwarded headers)

This validation tool can be found in the Admin User Interface on the "Health" page. Look for the section named "SEO". If everything looks good, you'll see a green checkbox similar to this:

Image Added

If there are issues detected, you'll see a red warning with details on what needs to be addressed.

Image Added

If issues are detected, you should use the documentation on this wiki page to address the detected issues.

Note
Even if you see a green checkmark on this page, you should still review all the Search Engine Optimization guidelines on this page. As noted above, this validator cannot detect all possible SEO issues, so manual verification is still required.

Keep your DSpace up to date

We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:

As of DSpace 9.0, 8.2, and 7.6.4, a basic SEO validator is now provided on the "Health" page.
As of DSpace 7.0, Sitemaps are enabled by default (see below)
As of DSpace 5.0, the DSpace robots.txt file now includes references to Sitemaps by default (see https://github.com/DSpace/DSpace/issues/5302), and also blocks known bad bots (see https://github.com/DSpace/DSpace/issues/5701).
As of DSpace 4.0, DSpace has provided several enhancements, which were requested by the Google Scholar team. These included providing users (and web indexers) a way to browse content by the date it was added to DSpace (see https://github.com/DSpace/DSpace/issues/4851), ensuring the "dc.date.issued" field is set more accurately (see https://github.com/DSpace/DSpace/issues/4850), and enhancing the logic behind the "citation_pdf_url"

...

HTML <meta> tag (see https://github.com/DSpace/DSpace/issues/4852)
As of DSpace 1.7, DSpace has improved how its Item-level metadata is made available to Google Scholar. For the 1.7.0 release, the DSpace Developers worked directly with the Google Scholar developers, to ensure DSpace is generating the "citation_*" HTML "<meta>" tags (i.e. Highwire Press tags) that Google Scholar recommends in their Indexing Guidelines.
As of DSpace 1.5, DSpace has support for sitemaps (both simple HTML pages of links, as well as the sitemaps.org protocol). It also includes item metadata in the HTML HEAD element of item display pages, ensuring that the metadata can be effectively indexed no matter what changes you might have made to your DSpace's layout or style.
As of DSpace 1.4, DSpace has support for the "if-modified-since" HTTP header. This basically means that if an item (or bitstream therein) has not changed since the last time a search engine's crawler indexed it, that item/bitstream does not have to be re-retrieved, sparing your server.

Additional minor improvements / bug fixes have been made to more recent releases of DSpace.

Ensure your DSpace is visible to search engines

First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus

If your site is not indexed at all, all search engines have a way to add your URL, e.g.:

Google: http://www.google.com/addurl
Google Scholar: https://scholar.google.com/intl/en/scholar/inclusion.html
Bing: https://www.bing.com/webmasters/about

Ensure your proxy is passing X-Forwarded headers to the User Interface

Some HTML tags important for SEO, such as the "citation_pdf_url" tag, require the full URL of your site. The DSpace user interface will automatically attempt to "discover" that URL using HTTP Headers.

Because most DSpace sites use some sort of proxy (e.g. Apache web server or Nginx or similar), this requires that the proxy be configured to pass along proper X-Forwarded-* headers, especially X-Forwarded-Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do something like this:

Code Block

# This lets DSpace know it is running behind HTTPS and what hostname is currently used
# (requires installing/enabling mod_headers)
RequestHeader set X-Forwarded-Proto https
RequestHeader set X-Forwarded-Host my.dspace.edu

Ensure the user interface is using server-side rendering

In DSpace, Angular's Server Side Rendering (SSR) feature is enabled by default (only when running in production mode). However, it's important to ensure you do not disable it in production mode, as most search engine bots cannot index your site if SSR is disabled. Per the frontend Installation instructions, you MUST also be running your user interface in production mode (via either npm run serve:ssr or npm start).

Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side. Server-side rendering allows your site to still function even when Javascript is turned off in a user's browser. Many web crawlers and bots do not support Javascript (e.g. Google Scholar), so they will only interact with this server-side rendered content.

If you are unsure if server-side rendering (SSR) is enabled, you can check to see if your site is accessible when Javascript is turned off. For example, in Chrome, you should be able to do the following:

Open your site in the Chrome browser
Turn off (disable) Javascript using the Chrome instructions: https://developer.chrome.com/docs/devtools/javascript/disable/
Click reload in your browser window to reload your site.
1. If SSR is enabled, then you will still see your site's contents. You should be able to browse & search the site. (Keep in mind, pages may take longer to load because every request requires SSR.) However, all dynamic menus or actions obviously will not work,

Because most DSpace sites use some sort of proxy (e.g. Apache web server or Nginx or similar), this requires that the proxy be configured to pass along proper X-Forwarded-* headers, especially X-Forwarded-Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do something like this:

Code Block

# This lets DSpace know it is running behind HTTPS and what hostname is currently used
# (requires installing/enabling mod_headers)
RequestHeader set X-Forwarded-Proto https
RequestHeader set X-Forwarded-Host my.dspace.edu

Ensure the user interface is using server-side rendering

In DSpace, Angular's Server Side Rendering (SSR) feature is enabled by default (only when running in production mode). However, it's important to ensure you do not disable it in production mode, as most search engine bots cannot index your site if SSR is disabled. Per the frontend Installation instructions, you MUST also be running your user interface in production mode (via either yarn run serve:ssr or yarn start).

Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side. Server-side rendering allows your site to still function even when Javascript is turned off in a user's browser. Many web crawlers and bots do not support Javascript (e.g. Google Scholar), so they will only interact with this server-side rendered content.

If you are unsure if server-side rendering (SSR) is enabled, you can check to see if your site is accessible when Javascript is turned off. For example, in Chrome, you should be able to do the following:

Open your site in the Chrome browser
Turn off (disable) Javascript using the Chrome instructions: https://developer.chrome.com/docs/devtools/javascript/disable/
Click reload in your browser window to reload your site.
1. If SSR is enabled, then you will still see your site's contents. You should be able to browse & search the site. (Keep in mind, pages may take longer to load because every request requires SSR.) However, all dynamic menus or actions obviously will not work, as all pages will be static HTML.
2. If SSR is disabled, then you will see a blank white page. You will not be able to see any content on your site.
Don't forget to re-enable Javascript after you are done testing (see link above, or just close that window & reopen a new one)

...

/bitstreams
/browse/* (UNLESS USING SITEMAPS)
/collections
/communities
/community-list (UNLESS USING SITEMAPS)
/entities/*
/handle
/items

Example good robots.txt

DSpace 7 comes with an example robots.txt file (which is copied below). As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpace 7 UI. This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.

The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.

Code Block

# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# NOTE: The <%= origin %> variables below will be replaced by the fully qualified URL of your site at runtime.
Sitemap: <%= origin %>/sitemap_index.xml
Sitemap: <%= origin %>/sitemap_index.html

##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems

# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register

# NOTE: The default robots.txt also includes a large number of recommended settings to avoid misbehaving bots.
# For brevity, they have been removed from this example, but can be found in src/robots.txt.ejs

WARNING: for your additional disallow statements to be recognized under the User-agent: * group, they cannot be separated by white lines from the declared user-agent: * block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.

This is OK:

Code Block
User-agent: * # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: * # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

For more information on the robots.txt format, please see the Google Robots.txt documentation.

Ensure Item Metadata appears in the HTML HEAD

It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace includes item metadata in the <head> element of each item's HTML display page.

Code Block
<meta name="DC.type" content="Article" /> <meta name="DCTERMS.contributor" content="Tansley, Robert" />

...

SITEMAPS)
/collections
/communities
/community-list (UNLESS USING SITEMAPS)
/entities/*
/handle
/items

Example good robots.txt

DSpace 7 comes with an example robots.txt file (which is copied below). As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpace 7 UI. This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.

The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.

Code Block

# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# NOTE: The <%= origin %> variables below will be replaced by the fully qualified URL of your site at runtime.
Sitemap: <%= origin %>/sitemap_index.xml
Sitemap: <%= origin %>/sitemap_index.html

##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems

# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register

# NOTE: The default robots.txt also includes a large number of recommended settings to avoid misbehaving bots.
# For brevity, they have been removed from this example, but can be found in src/robots.txt.ejs

WARNING: for your additional disallow statements to be recognized under the User-agent: * group, they cannot be separated by white lines from the declared user-agent: * block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.

This is OK:

Code Block
User-agent: * # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: * # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

For more information on the robots.txt format, please see the Google Robots.txt documentation.

Ensure Item Metadata appears in the HTML HEAD

Google Scholar Metadata in HTML HEAD

...

These meta tags are the "Highwire Press tags" which Google Scholar recommends. If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, you may do so by modifying https://github.com/DSpace/dspace-angular/blob/main/src/app/core/metadata/metadatahead-tag.service.ts (see for example the "setCitationAuthorTags()" method in that service class)

...

All Versions

DSpace Documentation

Page tree

Versions Compared

Old Version 1

New Version Current

Key

Keep your DSpace up to date

Ensure your DSpace is visible to search engines

Ensure your proxy is passing X-Forwarded headers to the User Interface

Check SEO Validator status

Keep your DSpace up to date

Ensure your DSpace is visible to search engines

Ensure your proxy is passing X-Forwarded headers to the User Interface

Ensure the user interface is using server-side rendering

Ensure the user interface is using server-side rendering

Example good robots.txt

Ensure Item Metadata appears in the HTML HEAD

Example good robots.txt

Ensure Item Metadata appears in the HTML HEAD

Google Scholar Metadata in HTML HEAD