One of the major goals of this event-based indexing approach is to reduce the impact of indexing on core repository functionality. The repository just creates a JMS event (containing only the resource identifier and the event type, which are already in memory), and does not need to do any extra work for indexing before moving on to its next task. When repository updates happen at a faster rate than the indexer can match, JMS events can wait in the queue until the indexer catches up, and the updates can continue without waiting. When processing large batches of updates, you can even disable the indexer.
The indexer can have any number of workers configured to process the events. So the main indexer process retrieves the resource RDF from the repository, and that content can be reused by multiple workers. If you want to process the events in several ways (triplestore, Solr, archive to disk, update remote repository, etc.), this limits the number of times the metadata has to be retrieved from the repository to once each time the resource is updated.
Several different indexer modules exist for syncing with different systems:
- ElasticIndexer - syncing with Elasticsearch
- FileSerializer - saves Solr document format to disk
- JcrXmlPersistenceIndexer - saves JCR/XML to disk
- RdfPersistenceIndexer - saves RDF to disk
- SolrIndexer - syncs with Solr
- SparqlIndexer - syncs with triplestores using SPARQL Update
The indexer is configured using Spring. Here is a sample configuration fragment showing three workers (saving RDF to disk, persisting jcr/xml, and syncing to a Jena Fuseki triplestore) and the framework for listening to events and connecting them with the workers:
To use another triplestore, change the SparqlIndexer bean configuration. Here is the bean configuration to use with Sesame running on port 8081:
Extending the Indexer
To implement a new kind of indexer:
- Implement the indexing functionality using the org.fcrepo.indexer.Indexer interface, which consists of only two methods (one to handle new/updated records, and another to handle deleted records). Any configuration required should be done using Java bean setter methods.
- Update the Spring configuration to add a bean referencing the new class and providing the configuration properties needed.
- Add the bean to the list of workers invoked by the indexer.
Trying Out the Indexer
To get hands-on experience with the indexer and see updates synced with an external triplestore, you need three components. Each component will potentially run in its own application container. The three components are:
- Triplestore (Fuseki or Sesame)
- Fedora 4 Repository
- JMS event listener/indexer
The triplestore and Fedora4 do not need to be aware of each other or of the JMS listener. However, the event-listener needs to know the web-endpoints of both the triplestore and Fedora 4. It is therefore important that you start the three components on different ports.
Instructions on how to start up and configure the three components follows:
- The easiest to setup is Jena Fuseki (Fuseki setup instructions).
- Alternatively, you can setup Sesame (Sesame setup instructions).
2. Fedora Repository
You can deploy Fedora4 either by downloading the latest war file and dropping it into an application container (e.g. Tomcat7). Or you can clone the Git fcrepo4 project and run the fcrepo-webapp directly within the code base.
See the following pages for details on either approach:
3. JMS Event Indexer
You can deploy the JMS event listener/indexer by downloading the latest war file and dropping it into an application container (e.g. Tomcat 7). Or you can clone the fcrepo-message-consumer project and run the fcrepo-message-consumer-pluggable directly within the code base. Building the project from source will likely make it easier to configure the JMS event listener/indexer.
You can specify the connection to either Fuseki or Sesame in the following configuration file.
- By default, Fuseki is expected
- To connect to Sesame instead, comment out the "queryBase", "updateBase", and "formUpdates" XML elements associated with Fuseki, and uncomment the corresponding Sesame XML elements in the configuration file mentioned above.
To configure the JMS indexer to connect to the Fedora Repository, you can set the following system variables
To configure the JMS indexer to connect to the triplestore, you can set the following system variables
... or if you are using Sesame:
Finally, you will potentially need to set the output directory for the FileSerializer (which is a testing class for showing what is being indexed)
Below is an example of how to download, build, and start the JMS indexer.
If the Fedora Repository is be running at http://localhost:8080/rest/ – you can create, update and delete resources using your browser, or using the REST API (see SPARQL Recipes ). Each event will trigger the indexer and be synced to Fuseki (or Sesame), which you can access at http://localhost:3030/ (if you have Fuseki running on its default port).
If you have a repository with existing content that you want to index, or have changed your indexing logic and want to reindex content, you can use the reindex REST API call in the indexer webapp.
To reindex the resource
http://localhost:8080/rest/objects/ and all of its children:
To reindex just the resource
http://localhost:8080/rest/objects/foo/, but not recursively reindex its children, add the
Indexing Multiple Repositories to a single Triplestore
In some situations it is desirable to have multiple Fedora repositories all feeding into a single external triplestore. In order to accomplish this, we need to install and setup the three components (Triplestore, Fedora 4 Repository and JMS event listener/indexer) as follows:
Follow the instructions above to install the triplestore (Fuseki or Sesame) in one machine and start it.
Follow the instructions above to install two or more Fedora 4 Repositories in different machines and start them.
Install JMS event listener/indexer (https://github.com/fcrepo4/fcrepo-message-consumer) for each Fedora 4 repository installation and start the indexer with the following command:
To make a resource indexable in the triplestore, the resource needs to include indexable mixin type:, which can be inserted through a SPARQL insert:
- Start the triplestore first. If the triplestore is restarted, then the JMS event listener/indexer needs to be restarted, too.