Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:
> quantile(file_sizes$size, c(0, .5, .7, .9, 1)) 0% 50% 70% 90% 100% 0 43447 195719 1835010 288032768 > quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 51302 4680240 10743517 39806094 72375144 221829327 1230705287 |
> quantile(file_counts$X1, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 7.00 22.00 29.00 62.00 99.40 280.08 1478.00 |
In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).
Objects contain the following datastreams:
Datastream ID | MIME Type | |
| text/xml | |
| text/xml | |
| application/xml | |
| text/xml | |
| application/rdf+xml |
On the filesystem are a variety of files (including some duplicates of data in fedora?), e.g.:
For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.
Using Fedora 3.7.1, clean install, using these properties:
database=mysql database.driver=included database.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver database.mysql.driver=included database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true database.username=fedora database.password=redacted install.type=custom deploy.local.services=false install.tomcat=false servlet.engine=existingTomcat fedora.home=/home/lyberadmin/apps/fedora/home fedora.serverHost=sul-fedora-dev-a.stanford.edu fedora.serverContext=fedora tomcat.http.port=8080 tomcat.shutdown.port=8005 ssl.available=true tomcat.ssl.port=8443 tomcat.home=/usr/share/tomcat6 ri.enabled=true messaging.enabled=false messaging.uri= apim.ssl.required=false apia.ssl.required=false apia.auth.required=false fesl.authz.enabled=false fesl.authn.enabled=true xacml.enabled=false keystore.file=included |
Tomcat is proxied through an Apache HTTPD server.
Using bash:
#!/bin/bash base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora" RuntimePrint() { duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./') echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec" echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time } CreateObject() { pid="druid:$1" curl -X POST "$base_url/objects/$pid" &> /dev/null cd /data-ro/assets/$1 for f in $( ls ); do datastreams=$[$datastreams+1] size=$[$size+`stat -c "%s" $f`] curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M" &> /dev/null done cd /data } BenchmarkObject() { objectId=$1 if [ -d /data-ro/assets/$objectId ]; then m1t=$(date +%s%N); m1l=$LINENO CreateObject $objectId m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint fi } export -f BenchmarkObject export -f CreateObject export -f RuntimePrint export base_url cat - | parallel -P $THREADS --env _ BenchmarkObject |
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.39300 1.46300 2.31500 6.64700 11.65780 36.77232 353.48700 |
0.2597 objects/s (objects per second)
Retrieve object profile
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.002 0.054 0.062 0.077 0.089 0.123 3.580 |
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.71400 3.46600 5.09440 14.32520 23.55560 72.11128 632.71200 2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k 584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps Tue Nov 19 19:09:08 PST 2013 : 16693 objects 1.031 objects/s (objects per second) |
4 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.013 0.014 0.017 0.020 0.031 0.073 Tue Nov 19 14:08:44 PST 2013 : retrieving all objects 160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k 0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps Tue Nov 19 14:11:26 PST 2013 : 16693 objects Tue Nov 19 14:11:26 PST 2013 : done 103 objects/s (objects per second) 8 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.011 0.022 0.025 0.031 0.034 0.045 0.093 Tue Nov 19 14:05:10 PST 2013 : retrieving all objects 159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k 0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps Tue Nov 19 14:07:39 PST 2013 : 16693 objects Tue Nov 19 14:07:39 PST 2013 : done 112.1 objects/s (objects per second) 16 threads: > quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.012 0.024 0.028 0.035 0.040 0.052 0.149 Tue Nov 19 14:11:50 PST 2013 : retrieving all objects 161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k 0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps Tue Nov 19 14:14:20 PST 2013 : 16693 objects Tue Nov 19 14:14:20 PST 2013 : done 110.9 objects/s (objects per second) |
Ingest all the data into fcrepo4 as binaries on containers.
Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071
Using curl:
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.6590 6.1800 9.1844 24.7972 44.7202 226.7644 1094.8120 |
Ingest speed over time
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.4670 5.2440 7.8554 20.3558 33.8186 101.2618 711.0130 |
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.9050 9.7360 13.9442 29.6692 48.0332 146.0208 1109.1760 |
Ingest all the data into fcrepo3 making reasonable content modeling assumptions:
- each page as an object
- ?
Using ActiveFedora:
Using ldp-client: