Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Islandora can provide a robust and full-featured institutional repository solution. The following documentation inventories the range of modules and configurations that enable core institutional repository functionality in Islandora.

Support

...

Content Models for ETDs and Archived Publications

Islandora Scholar Solution Pack

...

Islandora Scholar Solution Pack

Google Scholar requires a browse interface that allows its search robots to discover the URLs of articles in your repository. The Islandora Scholar module automatically generates the following views of your content, as recommended by Google Scholar.

  • For small collections (less than one thousand papers), e.g., papers written by a single author or a small group, all articles are listed on a single HTML page, such as www.example.edu/~professor/publications.html, including links to the full text PDF.
  • For medium sized collections (thousands of papers), articles are grouped by date of publication or the date of record entry. Other forms of browse interfaces, such as browse by author or by keyword, often generate more URLs than your website can deliver to the search robots in a reasonable amount of time. This view appears at yourdomain.edu/gs_years.
  • For very large collections (over one hundred thousand papers), an additional browse interface lists only the articles added in the last two weeks. This smaller set of webpages can be recrawled more frequently than your entire browse interface, which will facilitate timely coverage of your recent papers by the search robots. This view appears at yourdomain.edu/gs_updated.

Google Scholar recommends that the URL of every article is reachable from the homepage by following at most ten simple HTML links. Some institutions report success with placing hidden links to the above landing pages on prominent pages in their repository.

The Scholar solution pack also adds meta tags to citation and thesis object pages to be crawled by GoogleBots. Tags are also recognized by Zotero. 

Customizing your repository's look and feel can make it harder for search engines (and other tools and services such as Zotero, Connotea and SIMILE Piggy Bank) to correctly identify item metadata fields. To address this, the Islandora Google Scholar Module automatically places item metadata in the element of each item's HTML display page.

Dublin Core Metadata

<meta name="DC.type" content="Article" />
<meta name="DCTERMS.contributor" content="Tansley, Robert" />

If you have heavily customized your metadata fields away from Dublin Core, you can modify the crosswalk that generates these elements by modifying [islandora]/config/crosswalks/xhtml-head-item.properties.

Google Scholar Metadata

Google Scholar requires publications to have title, at least one author, and a publication date for inclusion. The Islandora Google Scholar Module includes these required fields in each item's HTML display page.

<meta content="Tansley, Robert" name="citation_author" />
<meta content="Donohue, Tim" name="citation_author" />
<meta content="Ensuring your DSpace is indexed" name="citation_title" />
<meta content="2018" name="citation_publication_date" />

Tip:

  • If you have heavily customized your metadata fields, or wish to change the default "mappings" to these Highwire Press tags, they are configurable in [islandora]/config/crosswalks/google-metadata.properties.

Islandora JSON-LD Module

The Islandora JSON-LD Module creates a standards-compliant JSON-LD record for an object based on a set of predefined MODS XPaths. Metadata structured as JSON-LD is increasingly important for dataset discovery through services like Google’s Dataset Search.

Islandora XML Sitemap module

The XML Sitemap module allows Islandora to expose its content in a way that search engines can easily crawl. Sitemaps allow crawlers to index your site without having to visit every page in your repository, meaning they can dow their work more quickly and efficiently). Without sitemaps, search engine indexing activity may significantly tax your repository. The Islandora XML Sitemap module works in conjunction with the Drupal XML Sitemap Custom module to automatically include Islandora objects in the Drupal sitemap. Detailed documentation on installing and configuring this module is available at https://wiki.duraspace.org/display/ISLANDORA/Islandora+XML+Sitemap.

Make your sitemap discoverable to search engines

Even if you've enabled the Islandora XML Sitemap module, search engines may not locate your sitemaps unless you provide a direct link. There are two main options for directing search engines to your sitemaps.

1. Provide a hidden link to the sitemaps on your repository's homepage. If you've customized your site's theme, ensure that there is a link to /htmlmap on your front or home page.

<a href="/htmlmap"></a>

2. Announce your sitemap in your robots.txt. For example:

# The FULL URL for your sitemaps (HTML and XML)
# XML sitemap is listed first as it is preferred by most search engines
# Make sure to replace "[islandora.url]" with the value of your 'islandora.url' setting in your islandora.cfg file.
Sitemap: [islandora.url]/sitemap.xml
Sitemap: [islandora.url]/htmlmap

3. Confirm that the sitemap was successfully submitted to Google via the Search Console (Crawl > Sitemap). Note that it may take a few days for Google to start crawling, and perhaps many days for it to finish, based on the priority Google assigns your content for indexing. If necessary, you can also manually submit a sitemap through the Search Console. Visit the Sitemap Report and enter the relative URL for the sitemap. Google should process the sitemap immediately, but site indexing may take some time. Follow the same steps to resubmit a sitemap if there are significant changes to your site structure.

Tips:

  • Sitemap: lines can be placed anywhere in your robots.txt file. You can specify multiple "Sitemap:" lines, so that search engines can locate XML and HTML formats. For more information, see: http://www.sitemaps.org/protocol.html#informing
  • Always include the FULL URL in the Sitemap: line. Relative paths are not supported.

Other Good Practices for SEO in Islandora

Keep Islandora up to date

New Islandora releases may include improvements to indexing tools.

Confirm that your site is visible to search engines

  • Check that your site is visible by performing a "site:" search on each relevant search engine (e.g., search Google for "site:wikipedia.org")
  • If your site does not show up, each search engine has its own process for manually submitting a URL for inclusion. The processes for common search engines include:

Create a strong robots.txt

Your site should include a robots.txt file, which indicates to search engine crawlers which pages or files they can request from your site. A robust robots.txt file must strike a balance between overloading your server with crawler requests and ensuring access to the content needed to comprehensively index your site. In the case of a repository, crawlers should be able to index item, collection and community pages, and all bitstreams within items. Crawlers access your site as an anonymous user; they will not be able to access restricted content.

Ensure that your robots.txt file is at the top level of your site: i.e. at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/islandora/robots.txt. If your Islandora instance is served from e.g. http://repo.foo.edu/islandora/, you'll need to add /islandora to all the paths in the examples below (e.g. /islandora/browse-subject).

Ensure your robots.txt allows access to critical indexing paths

You may wish to block crawlers from accessing some URLs in your repository that do not contain relevant information, such as log-in or registration pages or contact and feedback forms. However, blocking certain URLs can impede crawlers from properly indexing your site. Never put the following paths on Disallow: lines, or your repository might not be indexed properly:

  • /bitstream
  • /browse (UNLESS USING SITEMAPS)
  • /*/browse (UNLESS USING SITEMAPS)
  • /browse-date (UNLESS USING SITEMAPS)
  • /*/browse-date (UNLESS USING SITEMAPS)
  • /community-list (UNLESS USING SITEMAPS)
  • /handle
  • /html
  • /htmlmap

Tips:

  • Disallow: lines are case sensitive.

Ensure your robots.txt allows access to item "splash" pages and full text.

Full-text access is critically important for effective indexing, enabling keyword searching and citation analysis among other functions.

Example good robots.txt

The following example robots.txt includes highly recommended settings (uncommented) and additional optional settings (in comments). Your local configuration will determine whether you choose to enable optional settings. To do so, uncomment the corresponding Disallow: line.

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
Crawl-delay: 10
# The FULL URL to the Islandora sitemaps
# XML sitemap is listed first as it is preferred by most search engines
# Make sure to replace "[islandora.url]" with the value of your 'islandora.url' setting in your islandora.cfg file.
Sitemap: [islandora.url]/sitemap.xml
Sitemap: [islandora.url]/htmlmap
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
##########################
# Default Access Group
# (NOTE: blank lines are not allowable in a group record)
# trailing slash sections courtesy of advice from https://www.volacci.com/blog/fix-problems-drupal-default-robotstxt-file
##########################
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
# Paths (clean URLs) – no trailing forward slash
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – no trailing forward slash
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login

WARNING: for your additional disallow statements to be recognized under the User-agent: * group, they cannot be separated by blank lines from the declared user-agent: * block. A blank line indicates the start of a new user agent block. Without a leading user-agent declaration on the first line, blocks are ignored. Comment lines are allowed and will not break the user-agent block.

This is OK:

User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
Disallow: /statistics
Disallow: /contact

This is not OK, as the two lines at the bottom will be completely ignored.

User-agent: *
# Disable access to Discovery search and filters
Disallow: /discover
Disallow: /search-filter
  
Disallow: /statistics
Disallow: /contact

Tips:

  • To determine whether Google has access to a particular URL on your site, you can use the robots.txt tester in the Google Webmaster Tools suite.
  • For more information on the robots.txt format, please see the Google Robots.txt documentation.

Avoid redirecting file downloads to Item landing pages

Some repositories use custom URL redirects in order to facilitate capturing web analytics (e.g., Google Analytics). While these URL redirects may seem harmless, they may be flagged as cloaking or spam by Google, Google Scholar and other major search engines. This may hurt your site's search engine ranking or even cause your entire site to be flagged for removal from the search engine. Make sure that you never redirect "direct file downloads" (i.e. users who directly jump to downloading a file, often from a search engine) to the associated Item's splash/landing page.

Acknowledgements

Development of the Islandora JSON-LD module, Islandora Matomo Docker image, Islandora Matomo module and real-time map plug-in, Islandora Scholar Profiles module, and Islandora Metadata Export module, as well as enhancements to the Islandora Webform module and the Islandora Scholar solution pack was supported by a grant from the Andrew W. Mellon Foundation. To learn more about this work, visit the LASIR project page.

...