Data Overview
Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:
> quantile(file_sizes$size, c(0, .5, .7, .9, 1)) 0% 50% 70% 90% 100% 0 43447 195719 1835010 288032768 > quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 51302 4680240 10743517 39806094 72375144 221829327 1230705287
> quantile(file_counts$X1, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 7.00 22.00 29.00 62.00 99.40 280.08 1478.00
In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).
Objects contain the following datastreams:
Datastream ID | MIME Type | |
| text/xml | |
| text/xml | |
| application/xml | |
| text/xml | |
| application/rdf+xml |
On the filesystem are a variety of files (including some duplicates of data in fedora?), e.g.:
- 4.0K DC
- 20K Feigenbaum_00013946-METS.xml
- 4.0K Feigenbaum_00013946-TEXT.xml
- 4.0K RELS-EXT
- 44K bd826tf2716.pdf
- 4.0K bd826tf2716.txt
- 72K bd826tf2716_00001.jp2
- 8.0K bd826tf2716_00001.xml
- 64K bd826tf2716_00002.jp2
- 8.0K bd826tf2716_00002.xml
- 68K bd826tf2716_000BW.jp2
- 4.0K checksum
- 4.0K descMetadata
- 4.0K extracted_entities.xml
- 4.0K flipbook.json
- 4.0K flipbook.old
- 0 location
- 0 properties
- 0 stories
- 4.0K thumb.jpg
- 4.0K zotero.xml
Test 1: Simple Ingest into Fedora 3
For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.
Using Fedora 3.7.1, clean install, using these properties:
database=mysql database.driver=included database.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.driver=included database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.username=fedora database.password=redacted install.type=custom deploy.local.services=false install.tomcat=false servlet.engine=existingTomcat fedora.home=/home/lyberadmin/apps/fedora/home fedora.serverHost=sul-fedora-dev-a.stanford.edu fedora.serverContext=fedora tomcat.http.port=8080 tomcat.shutdown.port=8005 ssl.available=true tomcat.ssl.port=8443 tomcat.home=/usr/share/tomcat6 ri.enabled=true messaging.enabled=false messaging.uri= apim.ssl.required=false apia.ssl.required=false apia.auth.required=false fesl.authz.enabled=false fesl.authn.enabled=true xacml.enabled=false keystore.file=included
Tomcat is proxied through an Apache HTTPD server.
Using bash:
#!/bin/bash base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora" RuntimePrint() { duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./') echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec" echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time } CreateObject() { pid="druid:$1" curl -X POST "$base_url/objects/$pid" &> /dev/null cd /data-ro/assets/$1 for f in $( ls ); do datastreams=$[$datastreams+1] size=$[$size+`stat -c "%s" $f`] curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M" &> /dev/null done cd /data } BenchmarkObject() { objectId=$1 if [ -d /data-ro/assets/$objectId ]; then m1t=$(date +%s%N); m1l=$LINENO CreateObject $objectId m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint fi } export -f BenchmarkObject export -f CreateObject export -f RuntimePrint export base_url cat - | parallel -P $THREADS --env _ BenchmarkObject
Test 1a: Single-threaded ingest
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.32300 1.32700 2.16100 6.30100 11.10600 36.88524 338.28700
Test 1b: Single-threaded iteration
Retrieve object profile
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.002 0.054 0.062 0.077 0.089 0.123 3.580
Test 1c: 8-thread ingest test
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.71400 3.46600 5.09440 14.32520 23.55560 72.11128 632.71200 2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k 584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps Tue Nov 19 19:09:08 PST 2013 : 16693 objects 1.031 objects/s (objects per second)
Test 1d: Multi-threaded iteration test
4 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.013 0.014 0.017 0.020 0.031 0.073 Tue Nov 19 14:08:44 PST 2013 : retrieving all objects 160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k 0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps Tue Nov 19 14:11:26 PST 2013 : 16693 objects Tue Nov 19 14:11:26 PST 2013 : done 103 objects/s (objects per second) 8 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.022 0.025 0.031 0.034 0.045 0.093 Tue Nov 19 14:05:10 PST 2013 : retrieving all objects 159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k 0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps Tue Nov 19 14:07:39 PST 2013 : 16693 objects Tue Nov 19 14:07:39 PST 2013 : done 112.1 objects/s (objects per second) 16 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.012 0.024 0.028 0.035 0.040 0.052 0.149 Tue Nov 19 14:11:50 PST 2013 : retrieving all objects 161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k 0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps Tue Nov 19 14:14:20 PST 2013 : 16693 objects Tue Nov 19 14:14:20 PST 2013 : done 110.9 objects/s (objects per second)
Test 2: Simple Ingest into Fedora 4
Ingest all the data into fcrepo4 as datastreams on objects.
Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071
Using curl:
Test 3: Realistic Ingest into Fedora 3
Ingest all the data into fcrepo3 making reasonable content modeling assumptions:
- each page as an object
- ?
Using ActiveFedora:
Test 4: Realistic Ingest into Fedora 4
- add RDF as properties on objects
- Each page as a ordered same-name sibling on an object
Using ldp-client: