Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Solr Core Management

Overview

The Solr Core Management script is a utility developed to simplify and standardize the process of exporting and importing data from Solr cores used by DSpace (such as statistics or audit).

This tool is particularly useful for maintenance, data migration, and backup/restore operations, as it provides a consistent way to handle large Solr indexes without requiring direct Solr administrative access.

In production environments, Solr cores can grow significantly over time. Full exports or imports of these cores can be time-consuming and resource-intensive. To address this, the script includes incremental export capabilities and multi-threaded processing, allowing administrators to manage data more efficiently.

...

Why This Script Is Used

DSpace installations often rely heavily on Solr for usage statistics, and audit tracking. However, managing Solr cores manually through Solr’s admin interface or API can be problematic, especially when dealing with large indexes or multi-year data retention.

This script was designed to:

  • Automate routine Solr maintenance tasks, such as periodic exports for backup or archival.

  • Enable partial exports/imports based on time increments (e.g., by week, month, or year), making it possible to handle data in smaller, more manageable chunks.

  • Support parallel processing to improve performance during heavy operations or large data amount.

  • Standardize import/export formats (CSV or JSON).

In short, this tool allows administrators to safely back up, migrate, or rebuild Solr cores without disrupting DSpace operations or requiring direct low-level Solr commands.

Usage

The script can be executed through the DSpace command-line interface (the script is also available from the UI but requires administrative permission to run the script):

Parameters

./dspace solr-core-management [options]
         -m <mode:{export|import}>
         -c <core:{audit|statistics|...}>
         -d <directory>
         [-f <format:{csv|json}>]
         [-t <threads:integer>=1]
         [-s <start-date:yyyy-MM-dd>]
         [-e <end-date:yyyy-MM-dd>]
         [-i <increment:{WEEK|MONTH|YEAR}>]
         [-h]


Parameter Description

 

ParameterRequiredDescription
-m, --modeOperation mode: either export or import.
-c, --coreName of the Solr core to manage (e.g., statistics, authority, audit).
-d, --directoryDirectory where exported data will be stored or imported from.
-f, --format
File format for export/import. Supported formats: csv (default) or json.
-t, --threads
Number of threads used for parallel processing (default: 1).
-s, --start-date
Start date (in yyyy-MM-dd format) for time-based filtering during export.
-e, --end-date
End date (in yyyy-MM-dd format) for time-based filtering during export.
-i, --increment
Split the export into time-based chunks: WEEK, MONTH, or YEAR. Useful for very large datasets.
(default is MONTH)
-h, --help
Displays help and usage information.


...

Examples

Export Example

./dspace solr-core-management --mode export --core audit --directory /tmp/export --format csv --threads 4 --increment WEEK

This command exports the content of the audit core into the directory /tmp/export, splitting data by weekly increments.
The export is performed in CSV format, using 4 parallel threads for faster processing.

Incremental export is useful when the Solr core contains a large volume of records — for example, exporting weekly chunks prevents single massive files and allows resuming operations in case of partial failure.

...

Import Example

./dspace solr-core-management --mode import --core audit --directory /tmp/export --format csv --threads 2

This command imports previously exported data (from /tmp/export) back into the audit Solr core.
It uses 2 threads to parallelize document ingestion and supports the same format used during export (csv or json).

This operation is typically used when:

  • Rebuilding a Solr core after corruption or reindexing purposes.

  • Migrating Solr data between environments (e.g., production → test).

...

Best Practices

  • Better stoping DSpace activities before performing imports to avoid inconsistencies.
    (this also depends on the data where exporting)

  • Run exports with multiple threads when working with large datasets to reduce execution time.
    Be aware that multiple thread execution could lead to an increased workload on the Solr installation.

...

Note

The export process is designed to operate over date ranges rather than a single continuous dataset.
This approach serves two purposes:

  1. It makes data more manageable and modular, allowing administrators to back up or transfer only specific time periods (e.g., weekly or monthly exports).

  2. It avoids the need for deep pagination over very large result sets, which would require Solr to maintain an explicit sort order (sort=<sort-field>), significantly increasing memory usage and query time.

By splitting exports into smaller date-based ranges, the process minimizes Solr load, reduces the risk of timeouts, and ensures that each export segment can be completed efficiently even on heavily populated cores.