Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Once you've enabled your sitemaps, they will be accessible at the following URLs:

  • HTML SitemapsXML Sitemaps / Sitemaps.org syntax: [dspace.url]/htmlmapsitemap
  • HTML Google (XML) Sitemaps: [dspace.url]/sitemaphtmlmap

So, for example, if your "dspace.url = http://mysite.org/xmlui" in your "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/xmlui/htmlmap"

...

  1. Provide a hidden link to the sitemaps in your DSpace's homepage. If you've customized your site's look and feel (as most have), ensure that there is a link to /htmlmap in your DSpace's front or home page.By default, both the JSPUI and XMLUI provide this link in the footer:

    Code Block
    <a href="/htmlmap"></a>
  2. Announce your sitemap in your robots.txt.  Most major search engines will also automatically discover your sitemap if you announce it in your robots.txt file.  For example:

    Code Block
    Sitemap: http://my.dspace.url/sitemap
    Sitemap: http://my.dspace.url/htmlmap
    # The FULL URL to the DSpace sitemaps
    # XML sitemap is listed first as it is preferred by most search engines
    # Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
    Sitemap: [dspace.url]/sitemap
    Sitemap: [dspace.url]/htmlmap
    1. These
    2. NOTE that you need to replace "http://my.dspace.url" lines above with the full URL of your DSpace instance (this should correspond to the "dspace.url" setting in your dspace.cfg file)
    3. This "Sitemap:" lines can be placed anywhere in your robots.txt file. You can also specify multiple "Sitemap:" lines, so that search engines can locate both formats. For more information, see: http://www.sitemaps.org/protocol.html#informing
    4. Be sure to include the FULL URL in the "Sitemap:" line. Relative paths are not supported.

Search engines will now look Search engines will now look at your XML and HTML sitemaps, which serve pre-generated (and thus served with minimal impact on your hardware) XML or HTML files linking directly to items, collections and communities in your DSpace instance. Crawlers will not have to work their way through any browse screens, which are intended more for human consumption, and more expensive for the server.

...

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is served from e.g. http://repo.foo.edu/dspace/, you'll need to add /dspace to all the paths in the examples below (e.g. /dspace/browse-subject).

Warning

DSpace 1.5 and 1.5.1 ship with a bad robots.txt file. Delete it, or specifically the line that says Disallow: /browse. If you do not, your site will not be correctly indexed.

NEVER BLOCK THESE PATHS

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

NEVER BLOCK THESE PATHS

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

  • /bitstream
  • /browse  (
  • /bitstream
  • /browse  (UNLESS USING SITEMAPS)
  • /*/browse (UNLESS USING SITEMAPS)
  • /browse-date (UNLESS USING SITEMAPS)
  • /*/browse-date (UNLESS USING SITEMAPS)
  • /community-list (UNLESS USING SITEMAPS)
  • /handle
  • /html
  • /htmlmap

...

Below is an example good robots.txt.  The highly recommended settings are uncommented.  Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.

Code Block
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover 
Disallow: /search-filter

# This should be the FULL URL to your HTML Sitemap.  
# Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
Sitemap: http://[dspace.url]/htmlmap

# If you have configured DSpace (Solr-based) Statistics to be publicly accessible,
# then you likely do not want this content to be indexed# The FULL URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# Make sure to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
Sitemap: [dspace.url]/sitemap
Sitemap: [dspace.url]/htmlmap

##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
# For JSPUI, replace "/search-filter" above with "/simple-search"
#
# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /displaystatsbrowse
#
# UncommentIf theyou followinghave line ONLYconfigured if sitemaps.org or HTML sitemaps are used
# and you have verified that your site is being indexed correctly.DSpace (Solr-based) Statistics to be publicly 
# accessible, then you may not want this content to be indexed
# Disallow: /browsestatistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content:
# Disallow: /advanced-search user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register
# Disallow: /search

WARNING: Note that for your additional disallow statements to be recognized under the User-agent: * group, they can not cannot be separated by white lines from the declared user-agent: * block. A white line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.

...

Code Block
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover 
Disallow: /search-filter
Disallow: /displaystatsstatistics
Disallow: /advanced-searchcontact

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover 
Disallow: /search-filter
 
Disallow: /displaystatsstatistics
Disallow: /advanced-searchcontact

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

For more information on the robots.txt format, please see the Google Robots.txt documentation.

Ensure Item Metadata appears in the HTML HEAD

...

Code Block
<meta content="Tansley, Robert; Donohue, Timothy" name="citation_authors" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />


These meta tags are the "Highwire Press tags" which Google Scholar recommends.  If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [dspace]/config/crosswalks/google-metadata.properties

...