Current Lucene implementation in DSpace does not specify any sort criteria for searches. I suppose "relevancy" is then used by default (right?)
This page wants to discuss how one could parameterize other search criteria.
Google is demonstrating the advantage of bringing the most useful pages first. "Usefulness" is different when looking in different types of repositories. In PoisonCentre.be case, most recent articles are preferred by our users. In other context, relevancy, title, first author, affiliation, and even compound keys (e.g. year of publication + title) may be preferred.
A prototype patch has been submitted and may be examined on SourceForge to understand how this proposal could be implemented: Sorting simple/advanced Search results
Please do not hesitate to contact me for help implementing this patch: Christophe.Dupriez
This development is part of a bigger one I must do for managing links between records from DSpace (and non DSpace) repositories, a little bit like links in a Wiki. Preliminary specifications has been written.
Hopefully multiple sort criteria can be specified at search time: compound key will not be necessary. It will be possible to sort by the following combination of sort keys at search time: first author - reverse publication year - title. Publication year sort is in reversed order to insure more recent publications are listed first.
It could be a string made of sort keys (e.g. title, author, issuedate) prefixed by a minus sign if the key must be in reverse order. The sort parameter in the HTML form (JSP) could be written:
by Publication date
by Author
by Title
by Relevance
Two special sort names would be predefined:
DSpace often proposes predefined processing for authors, subjects and titles. This basic proposal would be in line with this approach by predefining essential sort indexes types:
Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):
''sort.index.1 = dateissued: date(dc.date.issued)
''sort.index.2 = firstauthor: name(dc.contributor.author)
''sort.index.3 = sorttitle: title(dc.title)
This basic implementation can be a step before going to something more complex.
I propose not to follow this proposal of mine for nowChristophe.Dupriez 19:26, 2 January 2008 (EST)
The sort key (somewhat like a Lucene tokenization process) must be specified by defining a sometime complex "string" expression. From my point of view, EL (JSTL Expression Language) has the level of expressiveness required especially if an expansion mechanism allows to add new Java String functions. It is not ideal (no concatenation operator: a function is required; it is interpreted, not compiled) but it "fits the job" (better proposals???).
A different way: we can use OGNL. It say "...Most of what you can do in Java is possible in OGNL, plus other extras..." --Bollini 07:20, 2 August 2007 (EDT)
Core EL functions are documented under Sun JCP.
Proposal modeled on search index definitions (search.index.1 = author:dc.contributor.*):
sort.index.1 = dateissued: fn:substring(dc.date.issued.1.value, 0, 10) #Keeps only 10 first characters of the date
sort.index.2 = firstauthor: fn:toUpperCase(fn:trim(dc.contributor.author.1.value)),'--',9999-substring(dc.date.issued.1.value, 0, 4) #Sort by author and then reverse publication year
sort.index.3 = sorttitle: ds:noLeadingArticle(fn:toUpperCase(fn:trim(dc.title.1.value)),dc.title.1.language) #Titles are sorted without leading article
Core EL functions are not all we need. Hopefully extensions mechanisms exist and can be used (how can this be done in line with extensions mechanisms currently designed?). In above example, we suppose a JSTL library "ds" has been defined and contains a public static Java function String noLeadingArticle(String title, String language).
Preprocessing may have to be done to convert above definitions to real EL:
N.B. Sort indexes and Search indexes would not have the same names: a Sort index should be usable for string searches (not word searches) like those needed for PersistentIdentifiers
EL Expressions could:
(not using EL (Expression Language) )
We would have 3 kinds of Lucene indexes:
Different new configurable options:
Another would provide those proposed in the Advanced Search "index selection box".
# Each index / identifier index / sort index below can be used # in a search equation. # The name of each should be defined in Message.properties search.analyzer = org.dspace.search.DSAnalyzer search.index.author=dc.contributor.*,dc.creator.*, dc.description.statementofresponsibility search.index.title=dc.title.* search.index.keyword=dc.subject.* search.index.abstract=dc.description.abstract, dc.description.tableofcontents search.index.series=dc.relation.ispartofseries search.index.sponsor=dc.description.sponsorship # We change "index" by "identifier" to indicate # that each "dc.identifier" fields values are one identifier #(KeywordAnalyzer: no separation of words) search.identifier.mime=dc.format.mimetype search.identifier.identifier=dc.identifier.* search.identifier.language=dc.language.iso # One may wish to not search "fulltext" in the simple search search.combined=author,title,keyword,abstract,sponsor,identifier,series # Only untokenized fields (single valued) can be used for sorting: search.sortindex.dateissued=dc.date.issued # Untokenized fields may have to be cleaned up before indexing. # Here, we remove time (and possibly days) to reduce # the potential problem with "Range" searches: search.cleanup.dateissued=be.destin.dspace.dateCleanUp # Only the first existing occurrence will be used # Note "-en" used to indicate the desired language occurrence search.sortindex.sorttitle=dc.title-en,dc.title-fr,dc.title search.cleanup.sorttitle=be.destin.dspace.titleCleanUp search.sortindex.lc=dc.subject.lc search.sortindex.mention=dc.description.statementofresponsibility search.identifier.issn=dc.identifier.issn search.sortindex.author1=dc.contributor # Index differently Arabic texts: search.index.arabtitle=dc.title-ar search.analyzer.arabtitle=gpl.pierrick.brihaye.aramorph.lucene.ArabicStemAnalyzer # List of advanced search indexes to proposes in Web UI (no JSP modification anymore) search.advanced=fulltext,title,arabtitle,author,keyword,abstract,sponsor,series,mime,language,identifier # Some prefers not to have full text search when making a "simple" search: search.combined=title,arabtitle,author,keyword,abstract,sponsor,series, identifier # Sort type may combine more than one sort field: # (Note the minus sign for decreasing order applied on one of the sort keys) search.sort.date=-dateissued,sorttitle search.sort.title=sorttitle,mention,-dateissued search.sort.journal=issn,dateissued,sorttitle search.sort.lc=lc,sorttitle search.sort.author1=author1,-dateissued # Order of the sort options proposed to the user (Advanced search, result browsing) search.sort=date,journal,author1,title,lc,SCORE |
Message.fr.properties (example for any other languages)
# The following would NOT be used anymore: jsp.search.advanced.type.abstract = R\u00E9sum\u00E9s jsp.search.advanced.type.author = Auteurs jsp.search.advanced.type.id = Identificateur jsp.search.advanced.type.keyword = Tous les index jsp.search.advanced.type.language = Langue (ISO) jsp.search.advanced.type.series = Collections jsp.search.advanced.type.sponsor = Organismes subventionnaires jsp.search.advanced.type.subject = Sujets jsp.search.advanced.type.title = Titres # The following are general names for defined indexes, # used for all places needing them for user display #(including Advanced Search JSP): search.index.fulltext=Texte intégral search.index.combined=Partout search.index.author=Auteurs search.index.title=Titres search.index.subject=Sujets search.index.abstract=Résumés search.index.series=Collections search.index.sponsor=Sponsor search.index.identifier=Identifiant search.index.language=Langue search.index.dateissued=Publication search.index.sorttitle=Titre search.index.lc=Cote LC search.index.mention=Mention de responsabilité search.index.issn=ISSN search.index.arabtitle=Titre en Arabe # Display names for possible sorts search.sort.date=par date search.sort.title=par titre search.sort.author1=par 1er auteur search.sort.lc=par cote LC search.sort.issn=par périodique search.sort.SCORE=par intérèt search.sort.DOC=par clé d'accès |
Christophe.Dupriez 19:21, 2 January 2008 (EST)