Table of Contents |
---|
Data Overview
Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:
...
Code Block |
---|
> quantile(file_sizes$size, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100%
0 43447 195719 1835010 288032768
> quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
51302 4680240 10743517 39806094 72375144 221829327 1230705287 |
Code Block |
---|
> quantile(file_counts$X1, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 100% 1 22 28 62 1478 95% 99% 100% 7.00 22.00 29.00 62.00 99.40 280.08 1478.00 |
In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).
...
- 4.0K DC
- 20K Feigenbaum_00013946-METS.xml
- 4.0K Feigenbaum_00013946-TEXT.xml
- 4.0K RELS-EXT
- 44K bd826tf2716.pdf
- 4.0K bd826tf2716.txt
- 72K bd826tf2716_00001.jp2
- 8.0K bd826tf2716_00001.xml
- 64K bd826tf2716_00002.jp2
- 8.0K bd826tf2716_00002.xml
- 68K bd826tf2716_000BW.jp2
- 4.0K checksum
- 4.0K descMetadata
- 4.0K extracted_entities.xml
- 4.0K flipbook.json
- 4.0K flipbook.old
- 0 location
- 0 properties
- 0 stories
- 4.0K thumb.jpg
- 4.0K zotero.xml
Test 1: Simple Ingest into Fedora 3
For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.
(Using Fedora 3.7.1, clean install, using these properties:
Code Block |
---|
database=mysql database.driver=included database.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.driver=included database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.username=fedora database.password=redacted install.type=custom deploy.local.services=false install.tomcat=false servlet.engine=existingTomcat fedora.home=/home/lyberadmin/apps/fedora/home fedora.serverHost=sul-fedora-dev-a.stanford.edu fedora.serverContext=fedora tomcat.http.port=8080 tomcat.shutdown.port=8005 ssl.available=true tomcat.ssl.port=8443 tomcat.home=/usr/share/tomcat6 ri.enabled=true messaging.enabled=false messaging.uri= apim.ssl.required=false apia.ssl.required=false apia.auth.required=false fesl.authz.enabled=false fesl.authn.enabled=true xacml.enabled=false keystore.file=included |
)
Tomcat is proxied through an Apache HTTPD server.
Using bash(single-threaded) Using curl:
Code Block |
---|
#!/bin/bash base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora" RuntimePrint() { duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./') echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec" echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time } CreateObject() { pid="druid:$1" curl -X POST "$base_url/objects/$pid" &> /dev/null cd /data-ro/assets/$1 for f in $( ls ); do datastreams=$[$datastreams+1] size=$[$size+`stat -c "%s" $f`] curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M" &> /dev/null done cd /data } BenchmarkObject() { objectId=$1 if [ -d /data-ro/assets/$objectId ]; then m1t=$(date +%s%N); m1l=$LINENO CreateObject $objectId m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint fi } export -f BenchmarkObject export -f CreateObject export -f RuntimePrint export base_url cat - | parallel -P $THREADS --env _ BenchmarkObject |
Test 1a: Single-threaded ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.39300 1.46300 2.31500 6.64700 11.65780 36.77232 353.48700 |
0.2597 objects/s (objects per second)
Test 1b:
...
Single-threaded iteration
Retrieve object profile
Code Block |
---|
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.002 0.054 0.062 0.077 0.089 0.123 3.580 |
...
Test 1c: 8-thread ingest test
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.71400 3.46600 5.09440 14.32520 23.55560 72.11128 632.71200
2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k
584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps
Tue Nov 19 19:09:08 PST 2013 : 16693 objects
1.031 objects/s (objects per second) |
Test 1d:
...
Multi-
...
threaded iteration test
Code Block |
---|
4 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.011 0.013 0.014 0.017 0.020 0.031 0.073
Tue Nov 19 14:08:44 PST 2013 : retrieving all objects
160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k
0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps
Tue Nov 19 14:11:26 PST 2013 : 16693 objects
Tue Nov 19 14:11:26 PST 2013 : done
103 objects/s (objects per second)
8 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.011 0.022 0.025 0.031 0.034 0.045 0.093
Tue Nov 19 14:05:10 PST 2013 : retrieving all objects
159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k
0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps
Tue Nov 19 14:07:39 PST 2013 : 16693 objects
Tue Nov 19 14:07:39 PST 2013 : done
112.1 objects/s (objects per second)
16 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.012 0.024 0.028 0.035 0.040 0.052 0.149
Tue Nov 19 14:11:50 PST 2013 : retrieving all objects
161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k
0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps
Tue Nov 19 14:14:20 PST 2013 : 16693 objects
Tue Nov 19 14:14:20 PST 2013 : done
110.9 objects/s (objects per second) |
Test 2: Simple Ingest into Fedora 4
Ingest all the data into fcrepo4 as datastreams on objects. binaries on containers.
Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071
Using curl:
Test 2a: Ingest all the data as containers and binaries, one at a time
Test 2b: Ingest all the data as containers arranged in a druid tree
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.6590 6.1800 9.1844 24.7972 44.7202 226.7644 1094.8120 |
Ingest speed over time
Test 2c: Ingest all the data as containers in a druid tree AND use fcr:batch
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.4670 5.2440 7.8554 20.3558 33.8186 101.2618 711.0130 |
Test 2d: Use a 4-node cluster to do a druid-tree ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.9050 9.7360 13.9442 29.6692 48.0332 146.0208 1109.1760 |
Using curl:
Test 3: Realistic Ingest into Fedora 3
Ingest all the data into fcrepo3 making reasonable content modeling assumptions:
- each page as an object
- ?
Using ActiveFedora:
Test 4: Realistic Ingest into Fedora 4
- add RDF as properties on objects resources
- Each page as a ordered same-name sibling on an object container
Using ldp-client: