Date: Thu, 28 Mar 2024 13:54:19 -0400 (EDT) Message-ID: <1112331151.28516.1711648459514@lyrasis1-roc-mp1> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_28515_192068049.1711648459514" ------=_Part_28515_192068049.1711648459514 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
The functionality is largely inspired by the SOLR official de-duplication approach, for eac= h item one or more signatures are computed using pluggable implementation.&= nbsp;
A signature = is a value that summarize the information in the item using a plug= gable transformation (case insensitive, ascii transcription, ident= ifier normalisation, etc), out of box implementation based on a normalizati= on of a single metadata (such as an identifier or the title) or a combinati= on of metadata (such as title + year, etc.) are included.
Two items are flagged as potential matches if they share at least one si= gnature.
Feedback on potential matches (reject or duplicate flag) are store= d in the database table dedup_reject
Signatures and matched groups are computed when an item is updated and s= tored on a dedicated SOLR core this make extremely fast and li= ghtweight to check for potential duplicate. This SOLR core is main= tained using DedupEventConsumer a script DedupClient is provided to rebuild the index or build it the = first time if you are migrating from a previous version.
Two functionalities have two point of interaction with the users
the detection mechanism for CRIS Objects is the same illustrated above f= or DSpace items. Out of box is possible to configure which metadata are use= d to identify duplicate among each object types. Custom signature algorithm= can be implemented and activeted via Spring bean in the same exact way tha= n for publications (dspace items)
A batch script is provided to manage potential duplicates among CRIS Obj= ects.
usage: = org.dspace.app.cris.batch.ScriptListAndRejectDedupObjects =20 -c,--compare compare two objects -h,--help help -i,--id <arg> object id -n,--note <arg> reject note -r,--reject reject two objects -t,--type <arg> object type USAGE: List duplicates: -t <object type> [-i <object id>] Compare two objects: -c -t <object type> -i <first object id> = <second object id> Reject two duplicate objects: -r -t <object type> -i <first objec= t id> <second object id> [-n <reject note>]
So to list all the groups of potential duplicates for researcher profile= s you need to execute
./dspac= e dsrun org.dspace.app.cris.batch.ScriptListAndRejectDedupObjects -t 9
using -t 10 you will get the list of potential duplicates among projects= and with -t 11 among organisations. It is also possible to list potential = duplicates of additional dynamic entities like journals, awards, etc. once = the the dynamic object type is known (i.e. 1001, 1002, ...)
Info
Please note that the script show only potential duplicates with status "= active" (i.e. CRIS entity MUST be not in withdrawn state)
To flag a potential duplicate as a fake detection you need to run the sc= ript specifying the type of the objects (9 for researcher profiles, etc.) a= nd the id of the two objects.
Please note that, contrary to what happen for rejection of duplicate sug= gestion among dspace items, the rejection is only stored in the deduplicati= on solr core. So if you rebuild the deduplication core using the org.d= space.app.cris.batch.DedupClient script you can potentially loss such infor= mation.
The org.dspace.app.cris.batch.DedupClient script has been extended = to support the -t parameter as well so to allow reindexing of specific obje= ct types
A batch script is provided to merge different instances of a cris object= in a single one. The script works on any kind of entity (researcher profil= es, organisation units, projects, etc.) with the following rules
usage: ScriptMergeCrisObject =20 -d,--delete delete merged objects, the default (without the -d option) is to disable them -h,--help help -m,--merge <arg> CRIS ID(s) to merge into the target (u= se multuple m if needed - merge occurs respecting the order from left to right) -p,--replace_notempty <arg> properties to override in the target w= ith the values from the merged objects IF NOT EMPTY -r,--replace <arg> properties to override in the target w= ith the values from the merged objects -s,--skip properties to ignore during the merge -t,--target <arg> CRIS ID to retain (merge target) -x,--exclude Don't merge properties, only move link from the merged object to the target USAGE: ScriptMergeCrisObjects -t <crisID> -m <toMergeCRIS-ID1> m <= toMergeCRIS-ID2> .. m <toMergeCRIS-IDn> [-r propR1 -r propR2... -= r propRN] [-p prop1 -p prop2... -p propN] [-s]