Sorting Search Results

Current Lucene implementation in DSpace does not specify any sort criteria for searches. I suppose "relevancy" is then used by default (right?)
This page wants to discuss how one could parameterize other search criteria.

Sorting, why?

Google is demonstrating the advantage of bringing the most useful pages first. "Usefulness" is different when looking in different types of repositories. In case, most recent articles are preferred by our users. In other context, relevancy, title, first author, affiliation, and even compound keys (e.g. year of publication + title) may be preferred.

Implementing one basic sort, how?

A prototype patch has been submitted and may be examined on SourceForge to understand how this proposal could be implemented: Sorting simple/advanced Search results

Please do not hesitate to contact me for help implementing this patch: Christophe.Dupriez

This development is part of a bigger one I must do for managing links between records from DSpace (and non DSpace) repositories, a little bit like links in a Wiki. Preliminary specifications has been written.

Different kinds of Sort

Hopefully multiple sort criteria can be specified at search time: compound key will not be necessary. It will be possible to sort by the following combination of sort keys at search time: first author - reverse publication year - title. Publication year sort is in reversed order to insure more recent publications are listed first.

User Interface

It could be a string made of sort keys (e.g. title, author, issuedate) prefixed by a minus sign if the key must be in reverse order. The sort parameter in the HTML form (JSP) could be written:

by Publication date

by Author

by Title

by Relevance

Two special sort names would be predefined:

Specifying a Sort Key (basic proposal)

DSpace often proposes predefined processing for authors, subjects and titles. This basic proposal would be in line with this approach by predefining essential sort indexes types:

Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):

''sort.index.1 = dateissued: date(

''sort.index.2 = firstauthor: name(

''sort.index.3 = sorttitle: title(dc.title)

This basic implementation can be a step before going to something more complex.

Specifying a Sort Key (more complex proposal)

I propose not to follow this proposal of mine for nowChristophe.Dupriez 19:26, 2 January 2008 (EST)

The sort key (somewhat like a Lucene tokenization process) must be specified by defining a sometime complex "string" expression. From my point of view, EL (JSTL Expression Language) has the level of expressiveness required especially if an expansion mechanism allows to add new Java String functions. It is not ideal (no concatenation operator: a function is required; it is interpreted, not compiled) but it "fits the job" (better proposals???).

A different way: we can use OGNL. It say "...Most of what you can do in Java is possible in OGNL, plus other extras..." --Bollini 07:20, 2 August 2007 (EDT)

Core EL functions are documented under Sun JCP.

Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):

sort.index.1 = dateissued: fn:substring(, 0, 10) #Keeps only 10 first characters of the date

sort.index.2 = firstauthor: fn:toUpperCase(fn:trim(,'--',9999-substring(, 0, 4) #Sort by author and then reverse publication year

sort.index.3 = sorttitle: ds:noLeadingArticle(fn:toUpperCase(fn:trim(dc.title.1.value)),dc.title.1.language) #Titles are sorted without leading article

Core EL functions are not all we need. Hopefully extensions mechanisms exist and can be used (how can this be done in line with extensions mechanisms currently designed?). In above example, we suppose a JSTL library "ds" has been defined and contains a public static Java function String noLeadingArticle(String title, String language).

Preprocessing may have to be done to convert above definitions to real EL:

N.B. Sort indexes and Search indexes would not have the same names: a Sort index should be usable for string searches (not word searches) like those needed for PersistentIdentifiers

Introducing EL in DSpace, could it open other doors?

EL Expressions could:

Complete configuration example

(not using EL (Expression Language) )

We would have 3 kinds of Lucene indexes:

  1. index: value words are indexed (tokenization)
  2. identifier: each value is indexed as one token (as with KeywordTokenizer)
  3. sortindex: only one value is indexed and as is (not tokenized)

Different new configurable options:

  1. A cleanup function (java class) could be called upon each value to normalize dates, remove leading articles, etc. before sending the value to be indexed to Lucene.
  2. A suffix to DSpace field name (-en is proposed) could permit to specify that an index is built starting from a specific linguistic version of a value.
  3. The Analyzer (and then the Filters and the Tokenizer) could vary from one field to another.
  4. A configuration option would provide the list of the indexes combined for a "simple" search.

Another would provide those proposed in the Advanced Search "index selection box".

# Each index / identifier index / sort index below can be used
# in a search equation.
# The name of each should be defined in
search.analyzer =*,dc.creator.*, dc.description.statementofresponsibility
search.index.abstract=dc.description.abstract, dc.description.tableofcontents
# We change "index" by "identifier" to indicate
# that each "dc.identifier" fields values are one identifier
#(KeywordAnalyzer: no separation of words)
# One may wish to not search "fulltext" in the simple search
# Only untokenized fields (single valued) can be used for sorting:
# Untokenized fields may have to be cleaned up before indexing.
# Here, we remove time (and possibly days) to reduce
# the potential problem with "Range" searches:
# Only the first existing occurrence will be used
# Note "-en" used to indicate the desired language occurrence
# Index differently Arabic texts:
# List of advanced search indexes to proposes in Web UI (no JSP modification anymore)
# Some prefers not to have full text search when making a "simple" search:
search.combined=title,arabtitle,author,keyword,abstract,sponsor,series, identifier
# Sort type may combine more than one sort field:
# (Note the minus sign for decreasing order applied on one of the sort keys),sorttitle
# Order of the sort options proposed to the user (Advanced search, result browsing)
search.sort=date,journal,author1,title,lc,SCORE (example for any other languages)

# The following would NOT be used anymore:   = R\u00E9sum\u00E9s     = Auteurs         = Identificateur    = Tous les index   = Langue (ISO)     = Collections    = Organismes subventionnaires    = Sujets      = Titres

# The following are general names for defined indexes,
# used for all places needing them for user display
#(including Advanced Search JSP):
search.index.fulltext=Texte intégral
search.index.sorttitle=Titre LC
search.index.mention=Mention de responsabilité
search.index.arabtitle=Titre en Arabe
# Display names for possible sorts date
search.sort.title=par titre
search.sort.author1=par 1er auteur cote LC
search.sort.issn=par périodique
search.sort.SCORE=par intérèt
search.sort.DOC=par clé d'accès

Christophe.Dupriez 19:21, 2 January 2008 (EST)