Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As our DSpace deployments get bigger and bigger, we inevitably run into scaling issues. Please add any you find here, or discuss how to address them.

Hint: When trying to find what's causing a problem (e.g. long processing time), you can find out a lot by upping your DSpace's logging level to DEUG. This increases the amount of information saved to

...

dspace/log/dspace.log

...

. See TechnicalFaq - "Setting logging level up to DEBUG".
See also: HowToPerformanceTuneForDspace.

Table of Contents
outlinetrue
stylenone

...

object instantiated loads in logo bitstream, any metadata template (for new submissions), and associated e-person groups (workflow steps, submitters, admins).

From DSpace@Cambridge

Our DSpace@Cambridge archive reached over 100,000 items last week, and
we're running into serious performance problems (actually, we've seen
performance degrade since we had a couple of tens of thousands of items).
I'll give an overview of some of the issues, and where possible, the
workarounds we devised for them.
A lot of these problems stem from the design of the database, some seem to
result from leaks in the code of rather inefficient database queries being
run. We've run PostgreSQL on bigger databases than this, and are quite
certain that it isn't the cause.
(I'd already run an analysis of the database queries performed by DSpace a
while back, and all indices that might help have already been created.)

Memory/DB pool leak

System running out of memory and/or the DCP Commons database pool
being exhausted - typically happens while being indexed by search engine
crawlers from Google/MSN/Yahoo/etc.
This happened to us well before we reached 100,000 items. The reason seems
to be in the design of the browse pages: we suspect that there's a
database pool connection leak here, but these pages also perform a lot of
queries, and/or are in se uncacheable. For example, if you look at an
item, then go to the "browse by author" page, the author of the item you
last looked at will be highlighted in the browse page. To search engines,
that means that these browse pages are different from the ones they've
cached when following the "browse by author" links from the DSpace front
page, essentially creating a new set of browse pages for every item. As
you can imagine, this causes a huge load on the system.

...

" dumped with some html around
the result rows will tend to be Good Enough to do the trick).

Notes on this issue

This may be out of date. DSpace 1.4.2 does not exhibit obvious database leaks.

OAI harvester

OAI PMH virtually hangs the system when being asked anything. We're
pretty sure this is caused by the database design, which lacks the
structure for the database engine to efficientely perform joins etc. For
now we've disabled the OAI harvesting interface completely.

Batch imports

Batch imports have become excruciatingly slow, and large batches
(several thousand items) can't be run at all, because somewhere there's a
memory leak. We've pretty much stopped importing items for now, since the
importer takes over 8 seconds per item, and that's unworkable (we have at
least 200,000 more items lined up to be imported).
No workaround (yet).

Browse pages

Browse pages have become very slow - in part because the database
spends quite a bit of time calculating the "showing items 1-20 of FOO"
text at the top (see earlier notes about database design).
No workaround (yet).

Notes on this issue

As of DSpace 1.5 there have been significant improvements to the end user experience for browsing large datasets. At Imperial College we have seen smooth performance of the front end browse using this code for up to 122,000 records. The down side is that the indexing process has an extremely non-linear response to archive size, and may become unusable as the archive expands.

From DSpace 1.41 @ PoisonCentre.be

42 thousands items imported from PubMed: 175 thousands references to authors, one million references to subjects.

...