Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
  2. Ensure your DSpace is visible to search engines.
  3. Enable the sitemaps feature – this does not require e.g. registering with Google Webmaster tools.
  4. Ensure your robots.txt allows access to item "splash" pages and full text.
  5. Ensure item metadata appears in HTML headers correctly.
  6. Avoid redirecting file downloads to Item landing pages
  7. Turn OFF any generation of PDF cover pages
  8. As an aside, it's worth noting that OAI-PMH is generally not useful to search engines.  OAI-PMH has its own uses, but do not expect search engines to use it.

 


Keep your DSpace up to date

...

Enable the sitemaps feature

DSpace provides a sitemap feature that we highly recommend you enable to ensure proper indexing.  Sitemaps allow DSpace to expose its content in a way that makes it easily accessible to search engine crawlers.  Sitemaps also help ensure that crawlers do NOT have to visit every page in your DSpace (which means the crawlers can get in and get out quickly, without taxing your site).  Without sitemaps, search engine indexing activity may impose significant loads on your repository.

HTML sitemaps provide a list of all items, collections and communities in HTML format, whilst Google sitemaps provide the same information in gzipped XML format.

To enable sitemaps, all you need to do is run [dspace]/bin/dspace generate-sitemaps once a day.

Just set up a cron job (or scheduled task in Windows), e.g. (cron):

...

As of DSpace 7, sitemaps are enabled by default and automatically update on a daily basis.  This is the recommended setup to prefer proper indexing. So, there's nothing you need to do unless you wish to either change their schedule, or disable them.

In the dspace.cfg, the Sitemap generation schedule is controlled by this setting

Code Block
# By default, sitemaps regenerate daily at 1:15am server time
sitemap.cron = 0 15 1 * * ?

You can modify this schedule by using the Cron syntax defined at https://www.quartz-scheduler.org/api/2.3.0/org/quartz/CronTrigger.html .  Any modifications can be placed in your local.cfg.

If you want to disable this automated scheduler, you can either comment it out, or set it to a single "-" (dash) in your local.cfg

Code Block
# This disables the automatic updates
sitemap.cron = -

Again, we highly recommend keeping them enabled.  However, you may choose to disable this scheduler if you wish to define these in your local system cron settings.

...

Once you've enabled your sitemaps, they will be accessible at the following URLs:

  • HTML Sitemaps: ${dspace.ui.url}/sitemap_index.html
  • XML Sitemaps: ${dspace.ui.url}/sitemap_index.xml
  • XML Sitemaps / Sitemaps.org syntax: [dspace.url]/sitemap
  • HTML Sitemaps: [dspace.url]/htmlmap

So, for example, if your "dspace.ui.url = httphttps://mysite.org/xmlui" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/xmlui/htmlmap"

The generate-sitemaps command

This command accepts several options:

Optionmeaning

-h

--help

Explain the arguments and options.

-s

--no_sitemaps

Do not generate a sitemap in sitemaps.org format.

-b

-no_htmlmap

Do not generate a sitemap in htmlmap format.

-a

--ping_all

Notify all configured search engines that new sitemaps are available.

-p URL

--ping URL

Notify the given URL that new sitemaps are available.  The URL of the new sitemap will be appended to the value of URL.

You can configure the list of "all search engines" by setting the value of sitemap.engineurls in dspace.cfg.

Make your sitemap discoverable to search engines

Even if you've enabled your sitemaps, search engines may not be able to find them unless you provide them with a link.  There are two main ways to notify a search engine of your sitemaps:

  1. Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap in your DSpace's front or home page. By default, both the JSPUI and XMLUI provide this link in the footer:

    Code Block
    <a href="/htmlmap"></a>
  2. Announce your sitemap in your robots.txt.  Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file. By default, both the JSPUI and XMLUI provide these references in their robots.txt file. For example:

    Code Block
    # The FULL URL to the DSpace sitemaps
    # XML sitemap is listed first as it is preferred by most search engines
    # Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
    Sitemap: [dspace.url]/sitemap
    Sitemap: [dspace.url]/htmlmap
    1. These "Sitemap:" lines can be placed anywhere in your robots.txt file. You can also specify multiple "Sitemap:" lines, so that search engines can locate both formats. For more information, see: http://www.sitemaps.org/protocol.html#informing
    2. Be sure to include the FULL URL in the "Sitemap:" line. Relative paths are not supported.

sitemap_index.html"

By default, the Sitemap URLs also will appear in your UI's robots.txt (in order to announce them to search engines):

Code Block
# The URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: /sitemap_index.xml
Sitemap: /sitemap_index.html

The generate-sitemaps command

If you wanted to generate your sitemaps manually, you can use a commandline tool to do so.

WARNING: Keep in mind, you do NOT need to run these manually in most situations, as sitemaps are autoupdated on a regular schedule (see documentation above)

Code Block
# Commandline option (run from the backend)
[dspace]/bin/dspace generate-sitemaps

This command accepts several options:

Optionmeaning

-h

--help

Explain the arguments and options.

-s

--no_sitemaps

Do not generate a sitemap in sitemaps.org format.

-b

-no_htmlmap

Do not generate a sitemap in htmlmap format.

-a

--ping_all

Notify all configured search engines that new sitemaps are available.

-p URL

--ping URL

Notify the given URL that new sitemaps are available.  The URL of the new sitemap will be appended to the value of URL.

You can configure the list of "all search engines" by setting the value of sitemap.engineurls in dspace.cfgSearch engines will now look at your XML and HTML sitemaps, which serve pre-generated (and thus served with minimal impact on your hardware) XML or HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

Create a good robots.txt

The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.

...

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

  • /bitstream
  • /browse  (UNLESS USING SITEMAPS)
  • /*/browse (UNLESS USING SITEMAPS)
  • /browse-date (UNLESS USING SITEMAPS)
  • /*/browse-date /*  (UNLESS USING SITEMAPS)
  • /community-list (UNLESS USING SITEMAPS)
  • /handle
  • /html
  • /htmlmap

...