All Versions
- DSpace 7.x (Current Release)
- DSpace 8.x (Unreleased)
- DSpace 6.x (EOL)
- DSpace 5.x (EOL)
- More Versions...
...
Ensure your proxy is passing X-Forwarded headers to the User Interface
...
...
...
Some HTML tags important for SEO, such as the "citation_pdf_url" tag, require the full URL of your site. The DSpace user interface will automatically attempt to "discover" that URL using HTTP Headers.
Because most DSpace sites use some sort of proxy (e.g. Apache web server or Nginx or similar), this requires that the proxy be configured to pass along proper X-Forwarded-* headers, especially X-Forwarded-Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do something like this:
Code Block |
---|
# This lets DSpace know it is running behind HTTPS and what hostname is currently used
# (requires installing/enabling mod_headers)
RequestHeader set X-Forwarded-Proto https
RequestHeader set X-Forwarded-Host my.dspace.edu |
In DSpace 7, server-side rendering is enabled by default (when running in production mode). However, it's important to ensure you do not disable it in production mode. Per the frontend Installation instructions, you MUST also be running your user interface in production mode (
Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side.
DSpace use Angular Universal for server-side rendering, and it's enabled by default in Production mode via this configuration in environment.common.ts:
Code Block |
---|
// Angular Universal Settings
universal: {
preboot: true,
...
}, |
Per the frontend Installation instructions, you must also be running your production frontend/UI via either yarn run serve:ssr
or yarn start).
As of DSpace 7, sitemaps are enabled by default and automatically update on a daily basis. This is the recommended setup to prefer proper indexing. So, there's nothing you need to do unless you wish to either change their schedule, or disable them.
In the dspace.cfg, the Sitemap generation schedule is controlled by this setting
Code Block |
---|
# By default, sitemaps regenerate daily at 1:15am server time
sitemap.cron = 0 15 1 * * ? |
You can modify this schedule by using the Cron syntax defined at https://www.quartz-scheduler.org/api/2.3.0/org/quartz/CronTrigger.html . Any modifications can be placed in your local.cfg.
If you want to disable this automated scheduler, you can either comment it out, or set it to a single "-" (dash) in your local.cfg
Code Block |
---|
# This disables the automatic updates
sitemap.cron = - |
Again, we highly recommend keeping them enabled. However, you may choose to disable this scheduler if you wish to define these in your local system cron settings.
Once you've enabled your sitemaps, they will be accessible at the following URLs:
So, for example, if your "dspace.ui.url = https://mysite.org" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/sitemap_index.html"
By default, the Sitemap URLs also will appear in your UI's robots.txt
(in order to announce them to search engines):
Code Block |
---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: /sitemap_index.xml
Sitemap: /sitemap_index.html |
If you wanted to generate your sitemaps manually, you can use a commandline tool to do so.
WARNING: Keep in mind, you do NOT need to run these manually in most situations, as sitemaps are autoupdated on a regular schedule (see documentation above)
Code Block |
---|
# Commandline option (run from the backend)
[dspace]/bin/dspace generate-sitemaps |
This command accepts several options:
...
-h
--help
...
-s
--no_sitemaps
...
-b
-no_htmlmap
...
-a
--ping_all
...
-p URL
--ping URL
...
Because the DSpace user interface is based on Angular.io (which is a javascript framework), you MUST have server-side rendering enabled (which is the default) for search engines to fully index your side. Server-side rendering allows your site to still function even when Javascript is turned off in a user's browser. Some web crawlers do not support Javascript (e.g. Google Scholar), so they will only interact with this server-side rendered content.
If you are unsure if server-side rendering (SSR) is enabled, you can check to see if your site is accessible when Javascript is turned off. For example, in Chrome, you should be able to do the following:
DSpace use Angular Universal for server-side rendering, and it's enabled by default in Production mode via our production environment initialization in src/environments/environment.production.ts:
Code Block |
---|
// Angular Universal Settings
universal: {
preboot: true,
...
}, |
For information, see "Universal (Server-side Rendering) settings" in User Interface Configuration
As of DSpace 7, sitemaps are enabled by default and automatically update on a daily basis. This is the recommended setup to prefer proper indexing. So, there's nothing you need to do unless you wish to either change their schedule, or disable them.
In the dspace.cfg, the Sitemap generation schedule is controlled by this setting
Code Block |
---|
# By default, sitemaps regenerate daily at 1:15am server time
sitemap.cron = 0 15 1 * * ? |
You can modify this schedule by using the Cron syntax defined at https://www.quartz-scheduler.org/api/2.3.0/org/quartz/CronTrigger.html . Any modifications can be placed in your local.cfg.
If you want to disable this automated scheduler, you can either comment it out, or set it to a single "-" (dash) in your local.cfg
Code Block |
---|
# This disables the automatic updates
sitemap.cron = - |
Again, we highly recommend keeping them enabled. However, you may choose to disable this scheduler if you wish to define these in your local system cron settings.
Once you've enabled your sitemaps, they will be accessible at the following URLs:
So, for example, if your "dspace.ui.url = https://mysite.org" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/sitemap_index.html"
By default, the Sitemap URLs also will appear in your UI's robots.txt
(in order to announce them to search engines):
Code Block |
---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: [dspace.ui.url]/sitemap_index.xml
Sitemap: [dspace.ui.url]/sitemap_index.html |
If you wanted to generate your sitemaps manually, you can use a commandline tool to do so.
WARNING: Keep in mind, you do NOT need to run these manually in most situations, as sitemaps are autoupdated on a regular schedule (see documentation above)
Code Block |
---|
# Commandline option (run from the backend)
[dspace]/bin/dspace generate-sitemaps |
This command accepts several options:
Option | meaning |
---|---|
-h --help | Explain the arguments and options. |
-s --no_sitemaps | Do not generate a sitemap in sitemaps.org format. |
-b -no_htmlmap | Do not generate a sitemap in htmlmap format. |
You can configure the list of "all search engines" by setting the value of sitemap.engineurls
in dspace.cfg
.
As of 7.5, DSpace's robots.txt file can be found in the UI's codebase at "src/robots.txt.ejs". This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.
The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.
If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.
Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).
Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.
/bitstreams
/browse/*
(UNLESS USING SITEMAPS)
/collections
/communities
/community-list
(UNLESS USING SITEMAPS)
/entities/*
/handle
/items
DSpace 7 comes with an example robots.txt file (which is copied below). As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpace 7 UI. This is an "embedded javascript template" (ejs) file, which simply allows for us to insert variable values into the "robots.txt" at runtime. It can be edited as a normal text file.
The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
Code Block |
---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# NOTE: The <%= origin %> variables below will be replaced by the fully qualified URL of your site at runtime.
Sitemap: <%= origin %>/sitemap_index.xml
Sitemap: <%= origin %>/sitemap_index.html
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# NOTE: The default robots.txt also includes a large number of recommended settings to avoid misbehaving bots.
# For brevity, they have been removed from this example, but can be found in src/robots.txt.ejs |
You can configure the list of "all search engines" by setting the value of sitemap.engineurls
in dspace.cfg
.
Server-side rendering is enabled by default. So, you don't need to do anything, unless you've accidentally turned it off.
The DSpace UI is built on Angular.io, which is a JavaScript (TypeScript) based web framework. As some search engines do not support JavaScript, you MUST ensure the UI's server-side rendering is enabled. This allows the UI to send plain HTML to search engine spiders (or other clients) which do not support JavaScript.
For information on enabling, see "Universal (Server-side Rendering) settings" in User Interface Configuration
You can test whether server-side rendering is enabled by temporarily disabling JavaScript in your browser (usually this is in the settings of the Developer Tools) and attempting to access your DSpace site. All basic browse/search functionality should work with JavaScript disabled. (However, all dynamic menus or actions obviously will not work, as all pages will be static HTML.)
The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.
If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.
Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).
Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.
/bitstreams
/browse/*
(UNLESS USING SITEMAPS)
/collections
/communities
/community-list
(UNLESS USING SITEMAPS)
/entities/*
/handle
/items
Below is an example good robots.txt. The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
Code Block |
---|
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: /sitemap_index.xml
Sitemap: /sitemap_index.html
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /search
Disallow: /admin/*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register |
WARNING: for your additional disallow statements to be recognized under the User-agent: *
group, they cannot be separated by white lines from the declared user-agent: *
block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.
...
If you have these URL redirects in place, it is highly recommended to remove them immediately. If you created these redirects to facilitate capturing download statistics in Google Analytics, you should consider upgrading to DSpace 5.0 or above, which is able to automatically record bitstream downloads in Google Analytics (see DS-2088see https://github.com/DSpace/DSpace/issues/5454) without the need for any URL redirects.
...