Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

 

Data Overview

Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:

 

Image Added

Code Block
> quantile(file_sizes$size, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100%
0 43447 195719 1835010 288032768
 
> quantile(object_size$V3, c(0, .5, .7, .9, .95, .99, 1))
        0%        50%        70%        90%        95%        99%       100% 
     51302    4680240   10743517   39806094   72375144  221829327 1230705287 

Image Added

Code Block
> quantile(file_counts$X1,  c(0, .5, .7, .9, .95, .99, 1))
     0%     50%     70%     90%     95%     99%    100% 
   7.00   22.00   29.00   62.00   99.40  280.08 1478.00

 

 

 

In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).

...

  • 4.0K DC
  •  20K Feigenbaum_00013946-METS.xml
  • 4.0K Feigenbaum_00013946-TEXT.xml
  • 4.0K RELS-EXT
  •  44K bd826tf2716.pdf
  • 4.0K bd826tf2716.txt
  •  72K bd826tf2716_00001.jp2
  • 8.0K bd826tf2716_00001.xml
  •  64K bd826tf2716_00002.jp2
  • 8.0K bd826tf2716_00002.xml
  •  68K bd826tf2716_000BW.jp2
  • 4.0K checksum
  • 4.0K descMetadata
  • 4.0K extracted_entities.xml
  • 4.0K flipbook.json
  • 4.0K flipbook.old
  •    0 location
  •    0 properties
  •    0 stories
  • 4.0K thumb.jpg
  • 4.0K zotero.xml

 

Test 1: Simple Ingest into Fedora 3

For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.

(Using Fedora 3.7.1, clean install, using these properties:

Code Block
database=mysql
database.driver=included
database.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.driver=included
database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.username=fedora
database.password=redacted
install.type=custom
deploy.local.services=false
install.tomcat=false
servlet.engine=existingTomcat
fedora.home=/home/lyberadmin/apps/fedora/home
fedora.serverHost=sul-fedora-dev-a.stanford.edu
fedora.serverContext=fedora
tomcat.http.port=8080
tomcat.shutdown.port=8005
ssl.available=true
tomcat.ssl.port=8443
tomcat.home=/usr/share/tomcat6
ri.enabled=true
messaging.enabled=false
messaging.uri=
apim.ssl.required=false
apia.ssl.required=false
apia.auth.required=false
fesl.authz.enabled=false
fesl.authn.enabled=true
xacml.enabled=false
keystore.file=included

 

)

Tomcat is proxied through an Apache HTTPD server.

 

Using bash(single-threaded) Using curl:

Code Block
#!/bin/bash
base_url="http://fedoraAdmin:fedoraAdmin@localhost:8080/fedora"
num=0

RuntimePrint()
{
 duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./')
 echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec"
 echo -e "${iobjectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time
}

if [ ! -d /data/tmp ]; then
  mkdir /data/tmp
fi

cd /data/tmp

for iCreateObject() {
    pid="druid:$1"
    curl -X POST "$base_url/objects/$pid" &> /dev/null
    cd /data-ro/assets/$1

    for f in $( cat /data/object_ids ls ); do
 
     numdatastreams=$[$num $datastreams+1]
      echo "($num): $i"
  mkdir /data/tmp/$i

  scp lyberadmin@salt-data-prod:/data/assets/$i/* /data/tmp/$isize=$[$size+`stat -c "%s" $f`]
      curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M"  &> /dev/null
    done
    cd /data
}

BenchmarkObject() {
  objectId=$1
  if [ -d /data-ro/assets/$objectId ]; then
    m1t=$(date +%s%N); m1l=$LINENO
    curl -X POST "$base_url/objects/druid:$i" &> /dev/null
    cd /data/tmp/$i

    for f in $( ls ); do
      ate-datastream-time curl -X POST --data-binary @$f "$base_url/objects/druid:$i/datastreams/$f?controlGroup=M"  &> /dev/null
    done

  m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint
    rm -rf /data/tmp/$i
doneCreateObject $objectId
    m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint
  fi
}

export -f BenchmarkObject
export -f CreateObject
export -f RuntimePrint
export base_url

cat - | parallel -P $THREADS --env _ BenchmarkObject

 

Test 1a: Single-threaded ingest

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
  0.39300   1.46300   2.31500   6.64700  11.65780  36.77232 353.48700 
0.2597 objects/s  (objects per second)

Test 1b: Single-threaded iteration

 

Retrieve object profile

Code Block
> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.002 0.054 0.062 0.077 0.089 0.123 3.580 

 

Test 1c: 8-thread ingest test

 

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
  0.71400   3.46600   5.09440  14.32520  23.55560  72.11128 632.71200 
2206.24user 5396.22system 4:29:58elapsed 46%CPU (0avgtext+0avgdata 1133728maxresident)k
584337920inputs+3216976outputs (67566major+823430581minor)pagefaults 0swaps
Tue Nov 19 19:09:08 PST 2013 : 16693 objects
 
1.031 objects/s  (objects per second)

 

Image Added

Image AddedImage Added

 

Test 1d: Multi-threaded iteration test

Code Block
4 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.011 0.013 0.014 0.017 0.020 0.031 0.073 
Tue Nov 19 14:08:44 PST 2013 : retrieving all objects
160.65user 247.90system 2:42.00elapsed 252%CPU (0avgtext+0avgdata 40736maxresident)k
0inputs+267608outputs (0major+63796227minor)pagefaults 0swaps
Tue Nov 19 14:11:26 PST 2013 : 16693 objects
Tue Nov 19 14:11:26 PST 2013 : done
103 objects/s  (objects per second)
 
8 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.011 0.022 0.025 0.031 0.034 0.045 0.093
Tue Nov 19 14:05:10 PST 2013 : retrieving all objects
159.54user 251.87system 2:28.86elapsed 276%CPU (0avgtext+0avgdata 40880maxresident)k
0inputs+267608outputs (0major+63891890minor)pagefaults 0swaps
Tue Nov 19 14:07:39 PST 2013 : 16693 objects
Tue Nov 19 14:07:39 PST 2013 : done
112.1 objects/s  (objects per second)
 
16 threads:
> quantile(read$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.012 0.024 0.028 0.035 0.040 0.052 0.149 
Tue Nov 19 14:11:50 PST 2013 : retrieving all objects
161.64user 264.68system 2:30.56elapsed 283%CPU (0avgtext+0avgdata 41104maxresident)k
0inputs+267608outputs (0major+64217330minor)pagefaults 0swaps
Tue Nov 19 14:14:20 PST 2013 : 16693 objects
Tue Nov 19 14:14:20 PST 2013 : done
110.9 objects/s  (objects per second)

 

Test 2: Simple Ingest into Fedora 4

Ingest all the data into fcrepo4 as datastreams binaries on objects containers.

Using jgroups configuration at https://gist.github.com/cbeer/fd3997e40fe014eab071

Using curl:

Test 2a: Ingest all the data as containers and binaries, one at a time

 

Test 2b: Ingest all the data as containers arranged in a druid tree

 

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
   0.6590    6.1800    9.1844   24.7972   44.7202  226.7644 1094.8120 

 

Ingest speed over time

Image Added

Image AddedImage Added

Test 2c: Ingest all the data as containers in a druid tree AND use fcr:batch

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
      0%      50%      70%      90%      95%      99%     100% 
  0.4670   5.2440   7.8554  20.3558  33.8186 101.2618 711.0130 

 

Image AddedImage AddedImage Added

Test 2d: Use a 4-node cluster to do a druid-tree ingest

Code Block
> quantile(create$V4, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
   0.9050    9.7360   13.9442   29.6692   48.0332  146.0208 1109.1760 


Image AddedImage AddedImage Added

Test 3: Realistic Ingest into Fedora 3

Ingest all the data into fcrepo3 making reasonable content modeling assumptions:

 - each page as an object

 - ? 

Using ActiveFedora:

Test 4: Realistic Ingest into Fedora 4

  • add RDF as properties on objects resources
  • Each page as a ordered same-name sibling on an object container 

 

Using ldp-client: