Page History
...
Some URLs can be disallowed without negative impact, but be ABSOLUTELY SURE the following URLs can be reached by crawlers, i.e. DO NOT put these on Disallow: lines, or your DSpace instance might not be indexed properly.
bitstream/
bitstreams
/browse/*
(UNLESS USING SITEMAPS)/collections
/communities
/community-list
(UNLESS USING SITEMAPS)
handle/
entities/*
html/
handle
htmlmap/
items
Example good robots.txt
Below is an example good robots.txt. The highly recommended settings are uncommented. Additional, optional settings are displayed in comments – based on your local configuration you may wish to enable them by uncommenting the corresponding "Disallow:" line.
Code Block |
---|
# The FULL URL to the DSpace sitemaps # XML sitemap is listed first as it is preferred by most search engines Sitemap: /sitemap_index.xml Sitemap: /sitemap_index.html ########################## # MakeDefault sureAccess to replace "[dspace.url]" with the value of your 'dspace.url' setting in your dspace.cfg file. Sitemap: [dspace.url]/sitemap Sitemap: [dspace.url]/htmlmap ########################## # Default Access Group # (NOTE: blankGroup # (NOTE: blank lines are not allowable in a group record) ########################## User-agent: * # Disable access to Discovery search and filters Disable access to Discovery search and filters; admin pages; processes; submission; workspace; workflow & profile page Disallow: /discoversearch Disallow: /admin/search-filter # For JSPUI, replace "/search-filter" above with "/simple-search" #* Disallow: /processes Disallow: /submit Disallow: /workspaceitems Disallow: /profile Disallow: /workflowitems # Optionally uncomment the following line ONLY if sitemaps are working # and you have verified that your site is being indexed correctly. # Disallow: /browse/* # # If you have configured DSpace (Solr-based) Statistics to be publicly # accessible, then you may not want this content to be indexed # Disallow: /statistics # # You also may wish to disallow access to the following paths, in order # to stop web spiders from accessing user-based content # Disallow: /contact # Disallow: /feedback # Disallow: /forgot # Disallow: /login # Disallow: /register |
...
This is OK:
Code Block |
---|
User-agent: * # Disable access to Discovery search and filters Disallow: /discover ; admin pages; processes Disallow: /search-filter Disallow: /admin/statistics* Disallow: /contactprocesses |
This is not OK, as the two lines at the bottom will be completely ignored.
Code Block |
---|
User-agent: * # Disable access to Discovery search and filters Disallow: /discover ; admin pages; processes Disallow: /search-filter Disallow: /statisticsadmin/* Disallow: /contactprocesses |
To identify if a specific user agent has access to a particular URL, you can use this handy robots.txt tester.
...
It's possible to greatly customize the look and feel of your DSpace, which makes it harder for search engines, and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank, to correctly pick out item metadata fields. To address this, DSpace (both XMLUI and JSPUI) DSpace includes item metadata in the <head> element of each item's HTML display page.
...