Old Release

This documentation covers an old version of Fedora. Looking for another version? See all documentation.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

 

Data Overview

Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:

 



 

 

 

In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).

Objects contain the following datastreams:

Datastream ID MIME Type
DC
text/xml
RELS-EXT
text/xml
extracted_entities
application/xml
location
text/xml
zotero
application/rdf+xml

 

On the filesystem are a variety of files (including some duplicates of data in fedora?), e.g.:

  • 4.0K DC
  •  20K Feigenbaum_00013946-METS.xml
  • 4.0K Feigenbaum_00013946-TEXT.xml
  • 4.0K RELS-EXT
  •  44K bd826tf2716.pdf
  • 4.0K bd826tf2716.txt
  •  72K bd826tf2716_00001.jp2
  • 8.0K bd826tf2716_00001.xml
  •  64K bd826tf2716_00002.jp2
  • 8.0K bd826tf2716_00002.xml
  •  68K bd826tf2716_000BW.jp2
  • 4.0K checksum
  • 4.0K descMetadata
  • 4.0K extracted_entities.xml
  • 4.0K flipbook.json
  • 4.0K flipbook.old
  •    0 location
  •    0 properties
  •    0 stories
  • 4.0K thumb.jpg
  • 4.0K zotero.xml

 

Test 1: Simple Ingest into Fedora 3

For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.

Using Fedora 3.7.1, clean install, using these properties:


 

Tomcat is proxied through an Apache HTTPD server.

 

Using bash:


 

Test 1a: Single-threaded ingest


0.2597 objects/s  (objects per second)

Test 1b: Single-threaded iteration

 

Retrieve object profile


 

Test 1c: 8-thread ingest test

 


 

 

Test 1d: Multi-threaded iteration test


 

Test 2: Simple Ingest into Fedora 4

Ingest all the data into fcrepo4 as Glossary on Glossary.

Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071

Using curl:

Test 2a: Ingest all the data as containers and binaries, one at a time

 

Test 2b: Ingest all the data as containers arranged in a druid tree

 


 

Ingest speed over time

Test 2c: Ingest all the data as containers in a druid tree AND use fcr:batch


 

Test 2d: Use a 4-node cluster to do a druid-tree ingest



Test 3: Realistic Ingest into Fedora 3

Ingest all the data into fcrepo3 making reasonable content modeling assumptions:

 - each page as an object

 - ? 

Using ActiveFedora:

Test 4: Realistic Ingest into Fedora 4

  • add RDF as properties on Glossary
  • Each page as a ordered same-name sibling on an container 

 

Using ldp-client:

 

 

  • No labels