Sorting Search Results
Current Lucene implementation in DSpace does not specify any sort criteria for searches. I suppose "relevancy" is then used by default (right?)
This page wants to discuss how one could parameterize other search criteria.
Google is demonstrating the advantage of bringing the most useful pages first. "Usefulness" is different when looking in different types of repositories. In PoisonCentre.be case, most recent articles are preferred by our users. In other context, relevancy, title, first author, affiliation, and even compound keys (e.g. year of publication + title) may be preferred.
Implementing one basic sort, how?
A prototype patch has been submitted and may be examined on SourceForge to understand how this proposal could be implemented: Sorting simple/advanced Search results
Please do not hesitate to contact me for help implementing this patch: Christophe.Dupriez
This development is part of a bigger one I must do for managing links between records from DSpace (and non DSpace) repositories, a little bit like links in a Wiki. Preliminary specifications has been written.
Different kinds of Sort
- Relevancy: no Sort key must be specified to Lucene
- IssueDate: an UNTOKENIZED field must be added to Lucene, containing the issue timestamp (or just its first 10 characters for date only indexing). Sorting on Integer is more efficient: the date could be rewritten as YYYYMMDD (8 digits)
- Title: an UNTOKENIZED field must be added to Lucence. Depending of the language, articles may have to be removed from the beginning of the title. Case and punctuation have to be normalized. All this before storing in a Lucene field because this is an UNTOKENIZED field.
- First author: An UNTOKENIZED field is (by definition) single valued. Lucene is therefore unable to generate a sort result where multiple entries exist for a single record (e.g. one for each author). Punctuation normalization is important for author names.
Hopefully multiple sort criteria can be specified at search time: compound key will not be necessary. It will be possible to sort by the following combination of sort keys at search time: first author - reverse publication year - title. Publication year sort is in reversed order to insure more recent publications are listed first.
- Localized sort names will have to be defined in the different Message.properties files.
- A menu of available sort orders could prefix all search results to allow users to choose the one they prefer.
- Browsing: Sort indexes could be used for a very efficient implementation of some of the current browsing types: title, date, first author. This could use Lucene class TermEnum. Other indexes would have to be created for fields with multiple values (TOKENIZED on a character delimiter which separate field occurrences)
Specifying a Sort Order at search timeThis will be a supplementary parameter to the search servlet. The search form may include a drop down menu for this parameter.
It could be a string made of sort keys (e.g. title, author, issuedate) prefixed by a minus sign if the key must be in reverse order. The sort parameter in the HTML form (JSP) could be written:
by Publication date
Two special sort names would be predefined:
- DOC: Lucene internal document number
- SCORE: Score of the document toward the search expression
Specifying a Sort Key (basic proposal)
DSpace often proposes predefined processing for authors, subjects and titles. This basic proposal would be in line with this approach by predefining essential sort indexes types:
- date: extract an 8 digits integer YYYYMMDD from a field containing a date as a string
- name: put field content in upper case and normalize punctuation
- title: put field content in upper case, normalize punctuation and removes leading articles
- default: put field content in upper case
Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):
''sort.index.1 = dateissued: date(dc.date.issued)
''sort.index.2 = firstauthor: name(dc.contributor.author)
''sort.index.3 = sorttitle: title(dc.title)
This basic implementation can be a step before going to something more complex.
Specifying a Sort Key (more complex proposal)
I propose not to follow this proposal of mine for nowChristophe.Dupriez 19:26, 2 January 2008 (EST)
The sort key (somewhat like a Lucene tokenization process) must be specified by defining a sometime complex "string" expression. From my point of view, EL (JSTL Expression Language) has the level of expressiveness required especially if an expansion mechanism allows to add new Java String functions. It is not ideal (no concatenation operator: a function is required; it is interpreted, not compiled) but it "fits the job" (better proposals???).
Core EL functions are documented under Sun JCP.
Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):
sort.index.1 = dateissued: fn:substring(dc.date.issued.1.value, 0, 10) #Keeps only 10 first characters of the date
sort.index.2 = firstauthor: fn:toUpperCase(fn:trim(dc.contributor.author.1.value)),'--',9999-substring(dc.date.issued.1.value, 0, 4) #Sort by author and then reverse publication year
sort.index.3 = sorttitle: ds:noLeadingArticle(fn:toUpperCase(fn:trim(dc.title.1.value)),dc.title.1.language) #Titles are sorted without leading article
Core EL functions are not all we need. Hopefully extensions mechanisms exist and can be used (how can this be done in line with extensions mechanisms currently designed?). In above example, we suppose a JSTL library "ds" has been defined and contains a public static Java function String noLeadingArticle(String title, String language).
Preprocessing may have to be done to convert above definitions to real EL:
- Commas at first level could indicated concatenation of multiple strings: it would be interpreted as a sequence of EL expressions to be concatenated
- dc... : It may be cumbersome to load a TreeMap of TreeMaps to represent the metadata of an Item (For instance: TreeMap dc containing TreeMap date, TreeMap contributor and TreeMap title; TreeMap date containing TreeMap issued; TreeMap issued containing TreeMap 1 containing String value, String language, etc.). Imagine the work for DSIndexer having to rebuild the whole thing! It may be much more interesting to define shorthand notations like dc.date.issued which could be automatically translated to ds:get('dc.date.issued')
N.B. Sort indexes and Search indexes would not have the same names: a Sort index should be usable for string searches (not word searches) like those needed for PersistentIdentifiers
Introducing EL in DSpace, could it open other doors?
EL Expressions could:
- streamline many aspects of JSP (JSP UI without Java, only JSTL tags and EL ! ) and could help the transition toward Manakin (comments???)
- specify any custom information to display using DSpace data without modifying JSP
- specify individual fields display (I will make a presentation of the LinkOut concept I developped for the PoisonCentre)
- help specify pre-processing of metadata before indexing
Complete configuration example
(not using EL (Expression Language) )
We would have 3 kinds of Lucene indexes:
- index: value words are indexed (tokenization)
- identifier: each value is indexed as one token (as with KeywordTokenizer)
- sortindex: only one value is indexed and as is (not tokenized)
Different new configurable options:
- A cleanup function (java class) could be called upon each value to normalize dates, remove leading articles, etc. before sending the value to be indexed to Lucene.
- A suffix to DSpace field name (-en is proposed) could permit to specify that an index is built starting from a specific linguistic version of a value.
- The Analyzer (and then the Filters and the Tokenizer) could vary from one field to another.
- A configuration option would provide the list of the indexes combined for a "simple" search.
Another would provide those proposed in the Advanced Search "index selection box".
# Each index / identifier index / sort index below can be used # in a search equation. # The name of each should be defined in Message.properties search.analyzer = org.dspace.search.DSAnalyzer search.index.author=dc.contributor.*,dc.creator.*, dc.description.statementofresponsibility search.index.title=dc.title.* search.index.keyword=dc.subject.* search.index.abstract=dc.description.abstract, dc.description.tableofcontents search.index.series=dc.relation.ispartofseries search.index.sponsor=dc.description.sponsorship # We change "index" by "identifier" to indicate # that each "dc.identifier" fields values are one identifier #(KeywordAnalyzer: no separation of words) search.identifier.mime=dc.format.mimetype search.identifier.identifier=dc.identifier.* search.identifier.language=dc.language.iso # One may wish to not search "fulltext" in the simple search search.combined=author,title,keyword,abstract,sponsor,identifier,series # Only untokenized fields (single valued) can be used for sorting: search.sortindex.dateissued=dc.date.issued # Untokenized fields may have to be cleaned up before indexing. # Here, we remove time (and possibly days) to reduce # the potential problem with "Range" searches: search.cleanup.dateissued=be.destin.dspace.dateCleanUp # Only the first existing occurrence will be used # Note "-en" used to indicate the desired language occurrence search.sortindex.sorttitle=dc.title-en,dc.title-fr,dc.title search.cleanup.sorttitle=be.destin.dspace.titleCleanUp search.sortindex.lc=dc.subject.lc search.sortindex.mention=dc.description.statementofresponsibility search.identifier.issn=dc.identifier.issn search.sortindex.author1=dc.contributor # Index differently Arabic texts: search.index.arabtitle=dc.title-ar search.analyzer.arabtitle=gpl.pierrick.brihaye.aramorph.lucene.ArabicStemAnalyzer # List of advanced search indexes to proposes in Web UI (no JSP modification anymore) search.advanced=fulltext,title,arabtitle,author,keyword,abstract,sponsor,series,mime,language,identifier # Some prefers not to have full text search when making a "simple" search: search.combined=title,arabtitle,author,keyword,abstract,sponsor,series, identifier # Sort type may combine more than one sort field: # (Note the minus sign for decreasing order applied on one of the sort keys) search.sort.date=-dateissued,sorttitle search.sort.title=sorttitle,mention,-dateissued search.sort.journal=issn,dateissued,sorttitle search.sort.lc=lc,sorttitle search.sort.author1=author1,-dateissued # Order of the sort options proposed to the user (Advanced search, result browsing) search.sort=date,journal,author1,title,lc,SCORE
Message.fr.properties (example for any other languages)
# The following would NOT be used anymore: jsp.search.advanced.type.abstract = R\u00E9sum\u00E9s jsp.search.advanced.type.author = Auteurs jsp.search.advanced.type.id = Identificateur jsp.search.advanced.type.keyword = Tous les index jsp.search.advanced.type.language = Langue (ISO) jsp.search.advanced.type.series = Collections jsp.search.advanced.type.sponsor = Organismes subventionnaires jsp.search.advanced.type.subject = Sujets jsp.search.advanced.type.title = Titres # The following are general names for defined indexes, # used for all places needing them for user display #(including Advanced Search JSP): search.index.fulltext=Texte intégral search.index.combined=Partout search.index.author=Auteurs search.index.title=Titres search.index.subject=Sujets search.index.abstract=Résumés search.index.series=Collections search.index.sponsor=Sponsor search.index.identifier=Identifiant search.index.language=Langue search.index.dateissued=Publication search.index.sorttitle=Titre search.index.lc=Cote LC search.index.mention=Mention de responsabilité search.index.issn=ISSN search.index.arabtitle=Titre en Arabe # Display names for possible sorts search.sort.date=par date search.sort.title=par titre search.sort.author1=par 1er auteur search.sort.lc=par cote LC search.sort.issn=par périodique search.sort.SCORE=par intérèt search.sort.DOC=par clé d'accès
Christophe.Dupriez 19:21, 2 January 2008 (EST)