Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Fix Sitemaps docs for 7.x

...

Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.

  • /

    bitstream

    bitstreams

  • /browse/*  (UNLESS USING SITEMAPS)

  • /collections

  • /communities

  • /community-list (UNLESS USING SITEMAPS)

  • /

    handle

    entities/*

  • /

    html

    handle

  • /

    htmlmap

    items

Example good robots.txt

Below is an example good robots.txt.  The highly recommended settings are uncommented.  Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.

Code Block
# The FULL URL to the DSpace sitemaps
# XML sitemap is listed first as it is preferred by most search engines
Sitemap: /sitemap_index.xml
Sitemap: /sitemap_index.html

##########################
# MakeDefault sureAccess to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file.
Sitemap: [dspace.url]/sitemap
Sitemap: [dspace.url]/htmlmap

##########################
# Default Access Group
# (NOTE: blankGroup
# (NOTE: blank lines are not allowable in a group record)
##########################
User-agent: *

# Disable access to Discovery search and filters Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page
Disallow: /discoversearch
Disallow: /admin/search-filter
# For JSPUI, replace "/search-filter" above with "/simple-search"
#*
Disallow: /processes
Disallow: /submit
Disallow: /workspaceitems
Disallow: /profile
Disallow: /workflowitems

# Optionally uncomment the following line ONLY if sitemaps are working
# and you have verified that your site is being indexed correctly.
# Disallow: /browse/*
#
# If you have configured DSpace (Solr-based) Statistics to be publicly 
# accessible, then you may not want this content to be indexed
# Disallow: /statistics
#
# You also may wish to disallow access to the following paths, in order
# to stop web spiders from accessing user-based content
# Disallow: /contact
# Disallow: /feedback
# Disallow: /forgot
# Disallow: /login
# Disallow: /register

...

This is OK:

Code Block
User-agent: * 
# Disable access to Discovery search and filters
Disallow: /discover ; admin pages; processes
Disallow: /search-filter
Disallow: /admin/statistics*
Disallow: /contactprocesses

This is not OK, as the two lines at the bottom will be completely ignored.

Code Block
User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover ; admin pages; processes
Disallow: /search-filter
 
Disallow: /statisticsadmin/*
Disallow: /contactprocesses

To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.

...

It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace (both XMLUI and JSPUI) DSpace  includes item metadata in the <head> element of each item's HTML display page.

...