Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
minLevel2
outlinetrue
stylenone

DSpace Log Converter

With the release of DSpace 1.6, new statistics software component was added. DSpace's use of SOLR for statistics makes it possible to have a database of statistics. This in mind, there is the issue of the older log files and how a site can use them. The following command process is able to convert the existing log files and then import them for SOLR use. The user will need to perform this only once.

...

The usage of this this option is strongly recommended, you should run this script daily (from crontab or your system's scheduler), to prevent your servlet container from running out of memory.

...

SOLR Sharding By Year

Command used:

[dspace]/bin/dspace stats-util

Java class:

org.dspace.statistics.util.StatisticsClient

Arguments (short and long forms):

Description

-s or shard-solr-index

Splits the data in the main core up into a separate solr core for each year, this will upgrade the performance of the solr.

Notes:

Yearly Solr sharding is a routine that can drastically improve the performance of your DSpace SOLR statistics. It was introduced in DSpace 3.0 and is not backwards compatible. The routine decreases the load created by the logging of new usage events by reducing the size of the SOLR Core in which new usage data are being logged. By running the script, you effectively split your current SOLR core, containing all of your usage events, into different SOLR cores that each contain the data for one year. In case your DSpace has been logging usage

...

events for less than one year, you will see no notable performance improvements until you run the script after the start of a new year. Both writing new usage events as well as read operations should be more performant over several smaller SOLR Shards instead of one monolithic one.

It is recommended that you execute this script once at the start of

...

every year. To ensure this is not forgotten, you can include it in your crontab or other system scheduling software.

Technical implementation details:

After sharding, the SOLR data

...

cores are located in the [dspace.dir]/solr directory

...

. There is no need to define the location of each individual core in solr.xml

...

because they are automatically retrieved at runtime. This retrieval happens in the static method located in theorg.dspace.statistics.SolrLogger

...

 class. These cores are stored in

...

the statisticYearCores

...

 list each time a query is made to the solr these cores are added as shards by

...

the addAdditionalSolrYearCores

...

 method. The cores share

...

a common configuration copied from your original statistics core. Therefore, no issues should be resulting from subsequent ant updates.

The actual sharding of the of the original solr core into

...

individual cores by year

...

is done in

...

the shardSolrIndex

...

 method in

...

the org.dspace.statistics.SolrLogger

...

 class. The sharding is done by first running a facet on the time to get the facets split by year. Once we have our years from our logs we query the main solr data server for all information on each year & download these as csv's. When we have all data for one year we upload it to the newly created core of that year by using

...

the update csvhandler. One all data of one year has been uploaded that data is removed from the main solr (by doing it this way if our solr crashes we do not need to start from scratch).