Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info

Please be aware that individual search engines also have their own guidelines and recommendations for inclusion. While the guidelines below apply to most DSpace sites, you may also wish to review these guidelines for specific search engines:

Ensuring your DSpace is indexed

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.

DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup.  Here's how to ensure your site is indexed.

For the optimum indexing, you should:

  1. Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
  2. Ensure your DSpace is visible to search engines.
  3. Enable the sitemaps feature – this does not require e.g. registering with Google Webmaster tools.
  4. Ensure your robots.txt allows access to item "splash" pages and full text.
  5. Ensure item metadata appears in HTML headers correctly.
  6. Avoid redirecting file downloads to Item landing pages
  7. Turn OFF any generation of PDF cover pages

Ensuring your DSpace is indexed

Anyone who has analyzed traffic to their DSpace site (e.g. using Google Analytics or similar) will notice that a significant (and in many cases a majority) of visitors arrive via a search engine such as Google or Yahoo. Hence, to help maximize the impact of content and thus encourage further deposits, it is important to ensure that your DSpace instance is indexed effectively.

DSpace comes with tools that ensure major search engines (Google, Bing, Yahoo, Google Scholar) are able to easily and effectively index all your content. However, many of these tools provide some basic setup.  Here's how to ensure your site is indexed.

For the optimum indexing, you should:

  1. Keep your DSpace up to date. We are constantly adding new indexing improvements in new releases
  2. Ensure your DSpace is visible to search engines.
  3. Enable the sitemaps feature – this does not require e.g. registering with Google Webmaster tools.
  4. Ensure your robots.txt allows access to item "splash" pages and full text.
  5. Ensure item metadata appears in HTML headers correctly.
  6. Avoid redirecting file downloads to Item landing pages
  7. As an aside, it's worth noting that OAI-PMH is generally not useful to search engines.  OAI-PMH has its own uses, but do not expect search engines to use it.

 

Keep your DSpace up to date

...

So, for example, if your "dspace.url = http://mysite.org/xmlui" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/xmlui/htmlmap"

Make your sitemap discoverable to search engines

Even if you've enabled your sitemaps, search engines may not be able to find them unless you provide them with a link.  There are two main ways to notify a search engine of your sitemaps:

  1. Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap in your DSpace's front or home page. By default, both the JSPUI and XMLUI provide this link in the footer:

    Code Block
    <a href="/htmlmap"></a>
  2. Announce your sitemap in your robots.txt.  Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file. By default, both the JSPUI and XMLUI provide these references in their robots.txt file. For example:

    Code Block
    # The FULL URL to the DSpace sitemaps
    # XML sitemap is listed first as it is preferred by most search engines
    # Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
    Sitemap: [dspace.url]/sitemap
    Sitemap: [dspace.url]/htmlmap
    1. These "Sitemap:" lines can be placed anywhere in your robots.txt file. You can also specify multiple "Sitemap:" lines, so that search engines can locate both formats. For more information, see: http://www.sitemaps.org/protocol.html#informing
    2. Be sure to include the FULL URL in the "Sitemap:" line. Relative paths are not supported.

Search engines will now look at your XML and HTML sitemaps, which serve pre-generated (and thus served with minimal impact on your hardware) XML or HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

Create a good robots.txt

The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.

If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).

NEVER BLOCK THESE PATHS

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

The generate-sitemaps command

This command accepts several options:

Optionmeaning

-h

--help

Explain the arguments and options.

-s

--no_sitemaps

Do not generate a sitemap in sitemaps.org format.

-b

-no_htmlmap

Do not generate a sitemap in htmlmap format.

-a

--ping_all

Notify all configured search engines that new sitemaps are available.

-p URL

--ping URL

Notify the given URL that new sitemaps are available.  The URL of the new sitemap will be appended to the value of URL.

You can configure the list of "all search engines" by setting the value of sitemap.engineurls in dspace.cfg.

Make your sitemap discoverable to search engines

Even if you've enabled your sitemaps, search engines may not be able to find them unless you provide them with a link.  There are two main ways to notify a search engine of your sitemaps:

  1. Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap in your DSpace's front or home page. By default, both the JSPUI and XMLUI provide this link in the footer:

    Code Block
    <a href="/htmlmap"></a>


  2. Announce your sitemap in your robots.txt.  Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file. By default, both the JSPUI and XMLUI provide these references in their robots.txt file. For example:

    Code Block
    # The FULL URL to the DSpace sitemaps
    # XML sitemap is listed first as it is preferred by most search engines
    # Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
    Sitemap: [dspace.url]/sitemap
    Sitemap: [dspace.url]/htmlmap
    1. These "Sitemap:" lines can be placed anywhere in your robots.txt file. You can also specify multiple "Sitemap:" lines, so that search engines can locate both formats. For more information, see: http://www.sitemaps.org/protocol.html#informing
    2. Be sure to include the FULL URL in the "Sitemap:" line. Relative paths are not supported.

Search engines will now look at your XML and HTML sitemaps, which serve pre-generated (and thus served with minimal impact on your hardware) XML or HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

Create a good robots.txt

The trick here is to minimize load on your server, but without actually blocking anything vital for indexing. Search engines need to be able to index item, collection and community pages, and all bitstreams within items – full-text access is critically important for effective indexing, e.g. for citation analysis as well as the usual keyword searching.

If you have restricted content on your site, search engines will not be able to access it; they access all pages as an anonymous user.

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).

NEVER BLOCK THESE PATHS

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

  • /bitstream
  • /browse  (UNLESS USING SITEMAPS)
  • /*/browse (UNLESS USING SITEMAPS)
  • /browse-date (UNLESS USING SITEMAPS)
  • /*/browse-date (UNLESS USING SITEMAPS)
  • /community-list (UNLESS USING SITEMAPS)
  • /handle
  • /html
  • /
  • /bitstream
  • /browse  (UNLESS USING SITEMAPS)
  • /*/browse (UNLESS USING SITEMAPS)
  • /browse-date (UNLESS USING SITEMAPS)
  • /*/browse-date (UNLESS USING SITEMAPS)
  • /community-list (UNLESS USING SITEMAPS)
  • /handle
  • /html
  • /htmlmap

Example good robots.txt

...

Code Block
<meta content="Tansley, Robert; Donohue, Timothy" name="citation_authorsauthor" />
<meta content="EnsuringDonohue, your DSpaceTim" name="citation_author" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />

...

These meta tags are the "Highwire Press tags" which Google Scholar recommends.  If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties

Much more information is available in the Configuration section on Google Scholar Metadata Mappings.

Avoid redirecting file downloads to Item landing pages

Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page.  In the past, some DSpace sites have added these custom URL redirects in order to facilitate capturing statistics via Google Analytics or similar.

While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine.

which Google Scholar recommends.  If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties

Much more information is available in the Configuration section on Google Scholar Metadata Mappings.

Avoid redirecting file downloads to Item landing pages

Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page.  In the past, some DSpace sites have added these custom URL redirects in order to facilitate capturing statistics via Google Analytics or similar.

While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine.

If you have these URL redirects in place, it is highly recommended to remove them immediately. If you created these redirects to facilitate capturing download statistics in Google Analytics, you should consider upgrading to DSpace 5.0 or above, which is able to automatically record bitstream downloads in Google Analytics (see DS-2088) without the need for any URL redirects.

Turn OFF any generation of PDF cover pages

While DSpace offers a PDF Citation Cover Page option, this option may affect your content's visibility in search engines like Google Scholar.  Google Scholar (and possibly other search engines) specifically extracts metadata by analyzing the contents of the first page of a PDF.  Dynamically inserting a custom cover page can break the metadata extraction techniques of Google Scholar and may result in all or much of your site being dropped from the Google Scholar search engine.

For more information, please see the "Indexing Repositories: Pitfalls and Best Practices" talk from Anurag Acharya (co-creator of Google Scholar) presented at the Open Repositories 2015 conferenceIf you have these URL redirects in place, it is highly recommended to remove them immediately. If you created these redirects to facilitate capturing download statistics in Google Analytics, you should consider upgrading to DSpace 5.0 or above, which is able to automatically record bitstream downloads in Google Analytics (see DS-2088) without the need for any URL redirects.

In general, OAI-PMH is not useful to Search Engines

...

  • No reliable way to determine OAI-PMH base URL for a DSpace site.
  • No standard or predictable way to get to item display page or full text from an OAI-PMH record, making effective indexing and presenting meaningful results difficult.
  • In most cases provides only access to simple Dublin Core, a subset of available metadata.
  • NOTE: Back in 2008, Google officially announced they were retiring support for OAI-PMH based Sitemaps. So, OAI-PMH will no longer help you get better indexing through Google. Instead, you should be using the DSpace 'generate-sitemaps' feature described above.

 T