All Versions
DSpace Documentation
...
Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahooother search engines. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.
DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup. Here's how to ensure your site is indexed.
| Info | ||
|---|---|---|
| ||
DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the below SEO policies. |
For the optimum indexing, you should:
Check SEO Validator status to detect any obvious issues
Ensure your proxy is passing X-Forwarded headers to the User Interface
We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:
Additional minor improvements / bug fixes have been made to more recent releases of DSpace.
First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus
If your site is not indexed at all, all search engines have a way to add your URL, e.g.:
...
DSpace now has a basic Search Engine Optimization (SEO) validator which can provide you feedback on how well your site may align with the some of these Search Engine Optimization policies.
At this time, this validation tool can only check three things:
This validation tool can be found in the Admin User Interface on the "Health" page. Look for the section named "SEO". If everything looks good, you'll see a green checkbox similar to this:
If there are issues detected, you'll see a red warning with details on what needs to be addressed.
If issues are detected, you should use the documentation on this wiki page to address the detected issues.
| Note |
|---|
Even if you see a green checkmark on this page, you should still review all the Search Engine Optimization guidelines on this page. As noted above, this validator cannot detect all possible SEO issues, so manual verification is still required. |
We are constantly adding new indexing improvements to DSpace. In order to ensure your site gets all of these improvements, you should strive to keep it up-to-date. For example:
...
Additional minor improvements / bug fixes have been made to more recent releases of DSpace.
First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus
If your site is not indexed at all, all search engines have a way to add your URL, e.g.:
Some HTML tags important for SEO, such as the "citation_pdf_url" tag, require the full URL of your site. The DSpace user interface will automatically attempt to "discover" that URL using HTTP Headers.
Because most DSpace sites use some sort of proxy (e.g. Apache web server or Nginx or similar), this requires that the proxy be configured to pass along proper X-Forwarded-* headers, especially X-Forwarded-Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do something like this:
| Code Block |
|---|
# This lets DSpace know it is running behind HTTPS and what hostname is currently used
# (requires installing/enabling mod_headers)
RequestHeader set X-Forwarded-Proto https
RequestHeader set X-Forwarded-Host my.dspace.edu |
In DSpace, Angular's Server Side Rendering (SSR) feature is enabled by default (only when running in production mode). However, it's important to ensure you do not disable it in production mode, as most search engine bots cannot index your site if SSR is disabled. Per the frontend Installation instructions, you MUST also be running your user interface in production mode (via either npm run serve:ssr or npm start).
Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side. Server-side rendering allows your site to still function even when Javascript is turned off in a user's browser. Many web crawlers and bots do not support Javascript (e.g. Google Scholar), so they will only interact with this server-side rendered content.
If you are unsure if server-side rendering (SSR) is enabled, you can check to see if your site is accessible when Javascript is turned off. For example, in Chrome, you should be able to do the following:
Because most DSpace sites use some sort of proxy (e.g. Apache web server or Nginx or similar), this requires that the proxy be configured to pass along proper X-Forwarded-* headers, especially X-Forwarded-Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do something like this:
| Code Block |
|---|
# This lets DSpace know it is running behind HTTPS and what hostname is currently used
# (requires installing/enabling mod_headers)
RequestHeader set X-Forwarded-Proto https
RequestHeader set X-Forwarded-Host my.dspace.edu |
In DSpace, Angular's Server Side Rendering (SSR) feature is enabled by default (only when running in production mode). However, it's important to ensure you do not disable it in production mode, as most search engine bots cannot index your site if SSR is disabled. Per the frontend Installation instructions, you MUST also be running your user interface in production mode (via either yarn run serve:ssr or yarn start).
Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side. Server-side rendering allows your site to still function even when Javascript is turned off in a user's browser. Many web crawlers and bots do not support Javascript (e.g. Google Scholar), so they will only interact with this server-side rendered content.
If you are unsure if server-side rendering (SSR) is enabled, you can check to see if your site is accessible when Javascript is turned off. For example, in Chrome, you should be able to do the following:
...
/bitstreams
/browse/* (UNLESS USING SITEMAPS)
/collections
/communities
/community-list (UNLESS USING SITEMAPS)
/entities/*
/handle
/items
DSpace 7 comes with an example robots.txt file (which is copied below). As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpace 7 UI. This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.
The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
| Code Block |
|---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# NOTE: The <%= origin %> variables below will be replaced by the fully qualified URL of your site at runtime.
Sitemap: <%= origin %>/sitemap_index.xml
Sitemap: <%= origin %>/sitemap_index.html
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# NOTE: The default robots.txt also includes a large number of recommended settings to avoid misbehaving bots.
# For brevity, they have been removed from this example, but can be found in src/robots.txt.ejs |
WARNING: for your additional disallow statements to be recognized under the User-agent: * group, they cannot be separated by white lines from the declared user-agent: * block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.
This is OK:
| Code Block |
|---|
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search
Disallow: /admin/*
Disallow: /processes |
This is not OK, as the two lines at the bottom will be completely ignored.
| Code Block |
|---|
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search
Disallow: /admin/*
Disallow: /processes |
To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.
For more information on the robots.txt format, please see the Google Robots.txt documentation.
It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace includes item metadata in the <head> element of each item's HTML display page.
| Code Block |
|---|
<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" /> |
...
SITEMAPS)
/collections
/communities
/community-list (UNLESS USING SITEMAPS)
/entities/*
/handle
/items
DSpace 7 comes with an example robots.txt file (which is copied below). As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpace 7 UI. This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.
The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
| Code Block |
|---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# NOTE: The <%= origin %> variables below will be replaced by the fully qualified URL of your site at runtime.
Sitemap: <%= origin %>/sitemap_index.xml
Sitemap: <%= origin %>/sitemap_index.html
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# NOTE: The default robots.txt also includes a large number of recommended settings to avoid misbehaving bots.
# For brevity, they have been removed from this example, but can be found in src/robots.txt.ejs |
WARNING: for your additional disallow statements to be recognized under the User-agent: * group, they cannot be separated by white lines from the declared user-agent: * block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.
This is OK:
| Code Block |
|---|
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search
Disallow: /admin/*
Disallow: /processes |
This is not OK, as the two lines at the bottom will be completely ignored.
| Code Block |
|---|
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes
Disallow: /search
Disallow: /admin/*
Disallow: /processes |
To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.
For more information on the robots.txt format, please see the Google Robots.txt documentation.
...
These meta tags are the "Highwire Press tags" which Google Scholar recommends. If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, you may do so by modifying https://github.com/DSpace/dspace-angular/blob/main/src/app/core/metadata/metadatahead-tag.service.ts (see for example the "setCitationAuthorTags()" method in that service class)
...