Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Overview

Info
titleOnly Applied to DSpace pre-1.4

<?xml version="1.0" encoding="utf-8"?>
<html>

...

The conclusions of the following analysis have been applied in the

...

1.4.2 of DSpace. Big performance improvements are obtained using Postgresql Vacuum/Analyze after a big batch import.

Over the course of the AIHT project, we noticed that as a DSpace repository grew to very large sizes ingestion time increased dramatically, resulting in poor performance during the batch import of large numbers of objects. I was given the task to analyze DSpace's SQL usage, in particular to find locations where SQL queries were inappropriately slow. Because of the large, distributed scope of the DSpace project, reconstruction and direct analysis of SQL queries being made by the repository is a slow and inexact science. A single object ingestion alone invokes around three hundred queries. On the other hand, because DSpace makes such heavy use of the database during batch ingestion, profiling and amortized analysis is a useful strategy for learning which kinds of SQL queries need special attention.

...

As benchmarking reveals, the improvement simply by adding these indices in a repository of 45 thousand items is about 9.8-fold. This speedup will clearly vary with the size of the repository, having little effect in small cases but impacting heavily on large-scale deployments. It is unlikely that further indexing would benefit DSpace ingest times; the most expensive call is now "SELECT 1", which is used extensively to validate connections and typically returns in under a millisecond. Extracting further speedup from DSpace ingestion thus would require the much more tedious task of refactoring the codebase to be more conservative with regard to query dispatch.</html>