...
Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:
Code Block |
---|
...
> quantile(file_sizes$size, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100%
0 43447 195719 1835010 288032768
> quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
51302 4680240 10743517 39806094 72375144 221829327 1230705287 |
Code Block |
---|
> quantile(file_counts$X1, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
7.00 22.00 29.00 62.00 99.40 280.08 1478.00 |
In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).
Objects contain the following datastreams:
Datastream ID | MIME Type | |
| text/xml | |
| text/xml | |
| application/xml | |
| text/xml | |
| application/rdf+xml |
On the filesystem are a variety of files (including some duplicates of data in fedora?), e.g.:
...
Using Fedora 3.7.1, clean install, using these properties:
Code Block |
---|
database=mysql
database.driver=included
database.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.driver=included
database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.username=fedora
database.password=redacted
install.type=custom
deploy.local.services=false
install.tomcat=false
servlet.engine=existingTomcat
fedora.home=/home/lyberadmin/apps/fedora/home
fedora.serverHost=sul-fedora-dev-a.stanford.edu
fedora.serverContext=fedora
tomcat.http.port=8080
tomcat.shutdown.port=8005
ssl.available=true
tomcat.ssl.port=8443
tomcat.home=/usr/share/tomcat6
ri.enabled=true
messaging.enabled=false
messaging.uri=
apim.ssl.required=false
apia.ssl.required=false
apia.auth.required=false
fesl.authz.enabled=false
fesl.authn.enabled=true
xacml.enabled=false
keystore.file=included |
Tomcat is proxied through an Apache HTTPD server.
Using bash:
Code Block |
---|
#!/bin/bash
base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora"
RuntimePrint()
{
duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./')
echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec"
echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time
}
CreateObject() {
pid="druid:$1"
curl -X POST "$base_url/objects/$pid" &> /dev/null
cd /data-ro/assets/$1
for f in $( ls ); do
datastreams=$[$datastreams+1]
size=$[$size+`stat -c "%s" $f`]
curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M" &> /dev/null
done
cd /data
}
BenchmarkObject() {
objectId=$1
if [ -d /data-ro/assets/$objectId ]; then
m1t=$(date +%s%N); m1l=$LINENO
CreateObject $objectId
m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint
fi
}
export -f BenchmarkObject
export -f CreateObject
export -f RuntimePrint
export base_url
cat - | parallel -P $THREADS --env _ BenchmarkObject |
Test 1a: Single-threaded ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1)) 0% 50% 70% 90% 95% 99% 100% 0.39300 1.46300 2.31500 6.64700 11.65780 36.77232 353.48700 |
0.2597 objects/s (objects per second)
...
Retrieve object profile
Code Block |
---|
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.002 0.054 0.062 0.077 0.089 0.123 3.580 |
Test 1c: 8-thread ingest test
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.71400 3.46600 5.09440 14.32520 23.55560 72.11128 632.71200
2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k
584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps
Tue Nov 19 19:09:08 PST 2013 : 16693 objects
1.031 objects/s (objects per second) |
Test 1d: Multi-threaded iteration test
Code Block |
---|
4 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.011 0.013 0.014 0.017 0.020 0.031 0.073
Tue Nov 19 14:08:44 PST 2013 : retrieving all objects
160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k
0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps
Tue Nov 19 14:11:26 PST 2013 : 16693 objects
Tue Nov 19 14:11:26 PST 2013 : done
103 objects/s (objects per second)
8 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.011 0.022 0.025 0.031 0.034 0.045 0.093
Tue Nov 19 14:05:10 PST 2013 : retrieving all objects
159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k
0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps
Tue Nov 19 14:07:39 PST 2013 : 16693 objects
Tue Nov 19 14:07:39 PST 2013 : done
112.1 objects/s (objects per second)
16 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.012 0.024 0.028 0.035 0.040 0.052 0.149
Tue Nov 19 14:11:50 PST 2013 : retrieving all objects
161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k
0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps
Tue Nov 19 14:14:20 PST 2013 : 16693 objects
Tue Nov 19 14:14:20 PST 2013 : done
110.9 objects/s (objects per second) |
Test 2: Simple Ingest into Fedora 4
...
Test 2b: Ingest all the data as containers arranged in a druid tree
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.6590 6.1800 9.1844 24.7972 44.7202 226.7644 1094.8120 |
Ingest speed over time
Test 2c: Ingest all the data as containers in a druid tree AND use fcr:batch
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.4670 5.2440 7.8554 20.3558 33.8186 101.2618 711.0130 |
Test 2d: Use a 4-node cluster to do a druid-tree ingest
Code Block |
---|
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
0% 50% 70% 90% 95% 99% 100%
0.9050 9.7360 13.9442 29.6692 48.0332 146.0208 1109.1760 |
Test 3: Realistic Ingest into Fedora 3
...