Title (goal)
High Volume of Concurrent Ingests
Primary ActorSubmitter
Scope 
Level 
Story

We need to be able to reliably load submission packages on a large scale from local network drives. We ingest many thousands of objects as part of a single batch submission. We would like to be able to ingest objects in the batch in parallel for higher throughput. The batches will not be part of a single Fedora transaction with rollback. (Our tools will sometimes pause, re-prioritize and then resume a batch ingest job.)

To support ongoing collection work an individual batch submission should take no longer than 2 days to ingest, given sufficient i/o and cluster resources. We anticipate approximately 10 submissions per month, each having 10k items and totaling 1TB. So that leaves a base ingest load of 100k items at 10TB per month, where each item is an object plus one primary data stream and around 4 smaller streams.

Scaling out to meet occasional load:

Our largest anticipated collection next year is a 10TB collection containing many video files. That would come in the form of perhaps 10 submission packages of 1TB each, containing files numbering in the hundreds. We'd like to be able to scale out our cluster to handle such an oversized collection without disrupting base ingest load.

That leaves a maximum ingest load of 200k items totaling 20TB per month.

Notes

Individual files in the video collection example will include HD video files that are several hours in length. So there will be files that extend to approx. 250GB (rough estimate of 4hrs JP2 at 1080p, 8bit color, 25fps)

2 Comments

  1. We already have some ingest performance data, which could be extrapolated out to compare with the performance targets in this ticket.

  2. Greg Jansen, the "ingest test matrix" rates achieved against the alpha-3 (linked above by David Wilcox) indicate the following characteristics:

    • Note, I am using the numbers from the UNC and UCSD test rows

    1. Extrapolation based on size:
      100 * 50-MB datastreams (i.e. 5-GB) of content ingested in ~140 seconds.
      This extrapolates into 1-TB in 28,000 seconds, or ~8 hours.
    2. Extrapolation based on number of objects:
      100 objects (or nodes) ingested in ~140 seconds
      This extrapolates into 60,000 nodes (1 item = 1 object + 5 datastreams) in 84,000 seconds, or ~24 hours.

    In either case, if the extrapolation is correct in assuming a linear scale, a basic single-server configuration would likely meet the requirement of performing the 60,000 node (or 1-TB) ingest in less than two days.

    Would you be able to perform an "at scale" test of your scenario to see if the calculations hold true?