Old Release

This documentation covers an old version of Fedora. Looking for another version? See all documentation.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

 

Data Overview

Stanford has a collection of publications consisting of page images, metadata, and arrangement (Saltworks), containing 16712 objects/655237 items/273GB of data with the following distribution:

 

> quantile(file_sizes$size, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100%
0 43447 195719 1835010 288032768

> quantile(file_counts$X1, c(0, .5, .7, .9, 1))
0% 50% 70% 90% 100% 
1 22 28 62 1478

 

 

 

In production, the object metadata is stored in Fedora, but the page images and other assets are stored on the file system and (somehow associated back to the object.. TBD).

Objects contain the following datastreams:

Datastream IDMIME Type
DC
text/xml
RELS-EXT
text/xml
extracted_entities
application/xml
location
text/xml
zotero
application/rdf+xml

 

On the filesystem are a variety of files (including some duplicates of data in fedora?), e.g.:

  • 4.0K DC
  •  20K Feigenbaum_00013946-METS.xml
  • 4.0K Feigenbaum_00013946-TEXT.xml
  • 4.0K RELS-EXT
  •  44K bd826tf2716.pdf
  • 4.0K bd826tf2716.txt
  •  72K bd826tf2716_00001.jp2
  • 8.0K bd826tf2716_00001.xml
  •  64K bd826tf2716_00002.jp2
  • 8.0K bd826tf2716_00002.xml
  •  68K bd826tf2716_000BW.jp2
  • 4.0K checksum
  • 4.0K descMetadata
  • 4.0K extracted_entities.xml
  • 4.0K flipbook.json
  • 4.0K flipbook.old
  •    0 location
  •    0 properties
  •    0 stories
  • 4.0K thumb.jpg
  • 4.0K zotero.xml

 

Test 1: Simple Ingest into Fedora 3

For a first test, we're going to ingest all the data from the filesystem into a clean fcrepo3 repository, using the filename as the datastream name.

Fedora 3.7.1, clean install, using these properties:

database=mysql
database.driver=included
database.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.jdbcDriverClass=com.mysql.jdbc.Driver
database.mysql.driver=included
database.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.mysql.jdbcURL=jdbc\:mysql\://localhost/fedora?useUnicode\=true
database.username=fedora
database.password=redacted
install.type=custom
deploy.local.services=false
install.tomcat=false
servlet.engine=existingTomcat
fedora.home=/home/lyberadmin/apps/fedora/home
fedora.serverHost=sul-fedora-dev-a.stanford.edu
fedora.serverContext=fedora
tomcat.http.port=8080
tomcat.shutdown.port=8005
ssl.available=true
tomcat.ssl.port=8443
tomcat.home=/usr/share/tomcat6
ri.enabled=true
messaging.enabled=false
messaging.uri=
apim.ssl.required=false
apia.ssl.required=false
apia.auth.required=false
fesl.authz.enabled=false
fesl.authn.enabled=true
xacml.enabled=false
keystore.file=included

 

Using bash:

#!/bin/bash
base_url="http://fedoraAdmin:fedoraAdmin@localhost/fedora"

RuntimePrint()
{
 duration=$(echo "scale=3;(${m2t}-${m1t})/(1*10^09)"|bc|sed 's/^\./0./')
 echo -e "${objectId} ${datastreams} ${size} ${duration}\tsec"
 echo -e "${objectId} ${datastreams} ${size} ${duration}" >> /data/fcrepo3-total-create-object-time
}

CreateObject() {
    pid="druid:$1"
    curl -X POST "$base_url/objects/$pid" &> /dev/null
    cd /data-ro/assets/$1

    for f in $( ls ); do
      datastreams=$[$datastreams+1]
      size=$[$size+`stat -c "%s" $f`]
      curl -X POST --data-binary @$f "$base_url/objects/$pid/datastreams/$f?controlGroup=M"  &> /dev/null
    done
    cd /data
}

BenchmarkObject() {
  objectId=$1
  if [ -d /data-ro/assets/$objectId ]; then
    m1t=$(date +%s%N); m1l=$LINENO
    CreateObject $objectId
    m2t=$(date +%s%N); m2l=$LINENO; RuntimePrint
  fi
}

export -f BenchmarkObject
export -f CreateObject
export -f RuntimePrint
export base_url

cat - | parallel -P $THREADS --env _ BenchmarkObject

 

Test 1a: Single-threaded ingest

> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
       0%       50%       70%       90%       95%       99%      100% 
  0.32300   1.32700   2.16100   6.30100  11.10600  36.88524 338.28700 

 

Test 1b: Single-threaded iteration

 

Retrieve object profile

> quantile(data$V2, c(0, .5, .7, .9, .95, .99, 1))
   0%   50%   70%   90%   95%   99%  100% 
0.002 0.054 0.062 0.077 0.089 0.123 3.580 

 

Test 1c: 8-thread test

 

 

Test 1d: 8-thread iteration test

 

Test 2: Simple Ingest into Fedora 4

Ingest all the data into fcrepo4 as datastreams on objects.

Using curl:

Test 3: Realistic Ingest into Fedora 3

Ingest all the data into fcrepo3 making reasonable content modeling assumptions:

 - each page as an object

 - ? 

Using ActiveFedora:

Test 4: Realistic Ingest into Fedora 4

  • add RDF as properties on objects
  • Each page as a ordered same-name sibling on an object 

 

Using ldp-client:

 

 

  • No labels