Date: Thu, 28 Mar 2024 19:09:09 -0400 (EDT) Message-ID: <1641698398.29140.1711667349208@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_29139_1704974960.1711667349207" ------=_Part_29139_1704974960.1711667349207 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Please be aware that individual search engines also have their own guide= lines and recommendations for inclusion. While the guidelines below apply t= o most DSpace sites, you may also wish to review these gui= delines for specific search engines:
Anyone who has analyzed traffic to their DSpace site (e.g. using Google = Analytics or similar) will notice that a significant (and in many cases a m= ajority) of visitors arrive via a search engine such as Google or Yahoo. He= nce, to help maximize the impact of content and thus encourage further depo= sits, it is important to ensure that your DSpace instance is indexed effect= ively.
DSpace comes with tools that ensure major search engines (Google, Bing, = Yahoo, Google Scholar) are able to easily and effectively index all your co= ntent. However, many of these tools provide some basic setup. Here's = how to ensure your site is indexed.
For the optimum indexing, you should:
Ensure your proxy is passing X-Forwarded hea= ders to the User Interface
We are constantly adding new indexing improvements to DSpace. In o= rder to ensure your site gets all of these improvements, you should strive = to keep it up-to-date. For example:
Additional minor improvements / bug fixes have been made to more recent = releases of DSpace.
First ensure your DSpace instance is visible, e.g. with: https://www.google.com/webmasters/tools/sitestatus
If your site is not indexed at all, all search engines have a way to add= your URL, e.g.:
Some HTML tags important for SEO, such as the "citation_pdf_url" tag, re= quire the full URL of your site. The DSpace user interface will autom= atically attempt to "discover" that URL using HTTP Headers.
Because most DSpace sites use some sort of proxy (e.g. Apache web server= or Nginx or similar), this requires that the proxy be con= figured to pass along proper X-Forwarded-* headers, especially X-Forwarded-= Host and X-Forwarded-Proto. For example in Apache HTTPD, you can do s= omething like this:
# This = lets DSpace know it is running behind HTTPS and what hostname is currently = used # (requires installing/enabling mod_headers) RequestHeader set X-Forwarded-Proto https RequestHeader set X-Forwarded-Host my.dspace.edu
In DSpace 7, server-side rendering is enabled by default (when runni=
ng in production mode). However, it's important to ensure you do n=
ot disable it in production mode. Per the frontend Installation instructions, you MUST a=
lso be running your user interface in production mode (via either yar=
n run serve:ssr
or yarn start).
Because the DSpace user interface is based on Angular.io (which is a jav= ascript framework), you MUST have server-side rendering enabled (which is t= he default) for search engines to fully index your side. Server-side = rendering allows your site to still function even when Javascript is turned= off in a user's browser. Some web crawlers do not sup= port Javascript (e.g. Google Scholar), so they will only interact with this= server-side rendered content.
If you are unsure if server-side rendering (SSR) is enabled, you can che= ck to see if your site is accessible when Javascript is turned off.= For example, in Chrome, you should be able to do the follow= ing:
DSpace use Angular Universal for server-side rendering,= and it's enabled by default in Production mode via our production environm= ent initialization in src/environments/environment.production.ts:
// Angu= lar Universal Settings universal: { preboot: true, ... },
For information, see "Universal (Server-side Rendering) settings" in User Interface Confi= guration
As of DSpace 7, sitemaps are enabled by default and automatically up= date on a daily basis. This is the recommended setup to pre= fer proper indexing. So, there's nothing you need to do unless you wish to = either change their schedule, or disable them.
In the dspace.cfg, the Sitemap generation schedule is controlled by this= setting
# By de= fault, sitemaps regenerate daily at 1:15am server time sitemap.cron =3D 0 15 1 * * ?
You can modify this schedule by using the Cron syntax defined at https://www.quartz-scheduler.org/= api/2.3.0/org/quartz/CronTrigger.html . Any modifications can be = placed in your local.cfg.
If you want to disable this automated scheduler, you can either comment = it out, or set it to a single "-" (dash) in your local.cfg
# This = disables the automatic updates sitemap.cron =3D -
Again, we highly recommend keeping them enabled. = However, you may choose to disable this scheduler if you wish to define the= se in your local system cron settings.
Once you've enabled your sitemaps, they will be accessible at the follow= ing URLs:
So, for example, if your "dspace.ui.url =3D https://mysite.org" in your = "dspace.cfg" configuration file, then the HTML Sitemaps would be at: "http://mysite.org/sitemap_index.html"
By default, the Sitemap URLs also will appear in your UI's robots.=
txt
(in order to announce them to search engines):
# The U= RL to the DSpace sitemaps # XML sitemap is listed first as it is preferred by most search engines Sitemap: [dspace.ui.url]/sitemap_index.xml Sitemap: [dspace.ui.url]/sitemap_index.html
If you wanted to generate your sitemaps manually, you can use a commandl= ine tool to do so.
WARNING: Keep in mind, you do NOT need to run these man= ually in most situations, as sitemaps are autoupdated on a regular schedule= (see documentation above)
# Comma= ndline option (run from the backend) [dspace]/bin/dspace generate-sitemaps
This command accepts several options:
Option | meaning |
---|---|
-h --help |
Explain the arguments and options. |
-s --no_sitemaps |
Do not generate a sitemap in sitemaps.org format= . |
-b -no_htmlmap |
Do not generate a sitemap in htmlmap format. |
You can configure the list of "all search engines" by setting the value =
of sitemap.engineurls
in dspace.cfg
.
As of 7.5, DSpace's robots.txt file can be found in the UI's codebase at= "src/robots.txt.ejs". This is an "embedded javascript template" (ejs= ) file, which simply allows for us to insert variable values into the "robo= ts.txt" at runtime. It can be edited as a normal text file.
The trick here is to minimize load on your server, but without actually = blocking anything vital for indexing. Search engines need to be able to ind= ex item, collection and community pages, and all bitstreams within items = =E2=80=93 full-text access is critically important for effective indexing, = e.g. for citation analysis as well as the usual keyword searching.
If you have restricted content on your site, search engines will not be = able to access it; they access all pages as an anonymous user.
Ensure that your robots.txt file is at the top level of your site: i.e. = at http://repo.foo.edu/robots.txt, and NOT e.g. http://repo.foo.edu/dspace/robots.txt. If your DSpace instance is s= erved from e.g. http://repo.foo.edu/dspace/, you'll need to add /= dspace to all the paths in the examples below (e.g. /dspace/browse-subject)= .
Some URLs can be disallowed without negative impact, but be ABSOLUTELY S= URE the following URLs can be reached by crawlers, i.e. DO NOT put these on= Disallow: lines, or your DSpace instance might not be indexed properly.
/bitstreams
/browse/*
(UNLESS USING SITEMAPS)
/collections
/communities
/community-list
(UNLESS USING SITEMAPS)
/entities/*
/handle
/items
DSpace 7 comes with an example robots.txt file (which is copied below).&= nbsp; As of 7.5, this file can be found at "src/robots.txt.ejs" in the DSpa= ce 7 UI. This is an "embedded javascript template" (ejs) file, which = simply allows for us to insert variable values into the "robots.txt" at run= time. It can be edited as a normal text file.
The highly recommended settings are uncommented. Additional, optio= nal settings are displayed in comments =E2=80=93 based on your local config= uration you may wish to enable them by uncommenting the corresponding "Disa= llow:" line.
# The U= RL to the DSpace sitemaps # XML sitemap is listed first as it is preferred by most search engines # NOTE: The <%=3D origin %> variables below will be replaced by the f= ully qualified URL of your site at runtime. Sitemap: <%=3D origin %>/sitemap_index.xml Sitemap: <%=3D origin %>/sitemap_index.html ########################## # Default Access Group # (NOTE: blank lines are not allowable in a group record) ########################## User-agent: * # Disable access to Discovery search and filters; admin pages; processes; s= ubmission; workspace; workflow & profile page Disallow: /search Disallow: /admin/* Disallow: /processes Disallow: /submit Disallow: /workspaceitems Disallow: /profile Disallow: /workflowitems # Optionally uncomment the following line ONLY if sitemaps are working # and you have verified that your site is being indexed correctly. # Disallow: /browse/* # # If you have configured DSpace (Solr-based) Statistics to be publicly # accessible, then you may not want this content to be indexed # Disallow: /statistics # # You also may wish to disallow access to the following paths, in order # to stop web spiders from accessing user-based content # Disallow: /contact # Disallow: /feedback # Disallow: /forgot # Disallow: /login # Disallow: /register # NOTE: The default robots.txt also includes a large number of recommended = settings to avoid misbehaving bots. # For brevity, they have been removed from this example, but can be found i= n src/robots.txt.ejs
WARNING: for your additional disallow statements to be recognized under =
the User-agent: *
group, they cannot be separated by white=
lines from the declared user-agent: *
block. A white lin=
e indicates the start of a new user agent block. Without a leading user-age=
nt declaration on the first line, blocks are ignored. Comment lines are all=
owed and will not break the user-agent block.
This is OK:
User-ag= ent: *=20 # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes
This is not OK, as the two lines at the bottom wil= l be completely ignored.
User-ag= ent: * # Disable access to Discovery search and filters; admin pages; processes Disallow: /search Disallow: /admin/* Disallow: /processes
To identify if a specific user agent has access to a particular URL, you= can use this handy robots.txt tester.
For more information on the robots.txt format, please see the Google Robots.txt documentat= ion.
It's possible to greatly customize the look and feel of your DSpace, whi= ch makes it harder for search engines, and other tools and services such as= Zotero, Connotea and SIMILE Piggy Bank, to= correctly pick out item metadata fields. To address this, DSpace inc= ludes item metadata in the <head> element of each item's HTML display= page.
<met= a name=3D"DC.type" content=3D"Article" /> <meta name=3D"DCTERMS.contributor" content=3D"Tansley, Robert" />
If you have heavily customized your metadata fields away from Dublin Cor= e, you can modify the service which generates these elements by modifying <= a class=3D"external-link" href=3D"https://github.com/DSpace/dspace-angular/= blob/main/src/app/core/metadata/metadata.service.ts" rel=3D"nofollow">https= ://github.com/DSpace/dspace-angular/blob/main/src/app/core/metadata/metadat= a.service.ts
In addition to Dublin Core <meta> tags in the HTML HEAD, DSpace al= so includes Google Scholar specific metadata fields in each item's HTML dis= play page.
<met= a property=3D"citation_author" content=3D"Tansley, Robert; Donohue, Timothy= "/> <meta property=3D"citation_title" content=3D"Ensuring your DSpace is ind= exed" />
These meta tags are the "Hig= hwire Press tags" which Google Scholar recommends. If you have he= avily customized your metadata fields, or wish to change the default "mappi= ngs" to these Highwire Press tags, you may do so by modifying https://github.co= m/DSpace/dspace-angular/blob/main/src/app/core/metadata/metadata.service.ts= (see for example the "setCitationAuthorTags()" method in that service = class)
Much more information is available in the Configuration section on Google Scholar Met= adata Mappings.
Make sure that you never redirect "direct file downloads" (i.e. us= ers who directly jump to downloading a file, often from a search engine) to= the associated Item's splash/landing page. In the past, some DSpace = sites have added these custom URL redirects in order to facilitate capturin= g statistics via Google Analytics or similar.
While these URL redirects may seem harmless, they may be flagged as = ;cloaking or spam by Google, Google Scholar and o= ther major search engines. This may hurt your site's search engine ranking = or even cause your entire site to be flagged for removal from the search en= gine.
If you have these URL redirects in place, it is highly recommended to re= move them immediately. If you created these redirects to facilitate capturi= ng download statistics in Google Analytics, you should consider upgrading t= o DSpace 5.0 or above, which is able to automatically record bitstream down= loads in Google Analytics (see https://github.com/DS= pace/DSpace/issues/5454) without the need for any URL redirects.= p>
While DSpace offers a PDF Citation Cover Page option, this option may affect your content'=
s visibility in search engines like Google Scholar. Google Scholar (a=
nd possibly other search engines) specifically extracts metadata by analyzi=
ng the contents of the first page of a PDF. Dynamically inserting a c=
ustom cover page can break the metadata extraction techniques of Google Sch=
olar and may result in all or much of your site being dropped from the Goog=
le Scholar search engine.
For more information, please see the "Indexing Repositories: Pitfalls and Best Practices=
" talk from Anurag Acharya (co-creator of Google Scholar) presented at =
the Open Repositories 2015 conference.
Feel free to support OAI-PMH, but be aware that in general it is not use= ful for search engines:
T