Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

NLM

Observations

  1. External datastreams.  Most of our binaries are of type E external.  The migration tool migrates the Fedora objects, but not the type E external binaries (as expected).  Thus we are left with object structure, metadata and RDF in OCFL format, but not the actual binaries themselves.  If, how, when and where to migrate external binaries to an OCFL structure is TBD, but a major consideration for us in adopting OCFL.
  2. Speed. The tool migrates objects at the rate of 15K-40K objects per hour. This should be manageable for our purpose.
  3. For the "citations" repository, it consistently takes 30 minutes to build the datastream index before starting the migration. This server has 3.8M managed datastreams (1 per object). The option to cache this index when resuming migrations is helpful.
  4. CPU time. Consumes about 30%.
  5. Layout. In flat and pairtree migrations the PID is used to form the path; for example PID nlm:nlmuid-101588995-bk (stored FOXML file name nlm_nlmuid-101588995-bk) becomes /ocfl/nl/m+/nl/mu/id/-1/01/58/89/95/-b/k/5-bk. Characters such as – are problematic in Linux.  See 
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyFCREPO-3180
    .
  6. It would be nice to declare use of another field, or input map, to dictate the value to use for layout path generation. For example, it may be nice to use 101588995_bk to generate a path for PID nlm:nlmuid-101588995-bk.  Also included in 
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyFCREPO-3180
    .
  7. Migrated datastreams have no file extension. It would be nice if migrated datastreams have a file extension inferred from the MIME type; e.g. DC.xml instead of just DC, and OCR.txt instead of just OCR. This should particularly help out with in-line XML datastreams.  
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyFCREPO-3181
  8. OCFL versions appear to be created based on datastream timestamps. Each unique timestamp creates a new OCFL version, even if they were part of the same Fedora version in the AUDIT trail and differed only by milliseconds.
  9. Add XML declarations for migrated in-line datastreams.  
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyFCREPO-3197

...

Number

of objects

Execution

Time

Source

Layout

Dest.

Layout

Migration

tool version

Notes
10004 minlegacypairtree11/26/191K fedora items produced 42K+ files
10003 minlegacytruncated11/26/191K fedora items produced 43K+ files
100,0006.5 hourslegacyflat11/26/19
1 million~3 dayslegacypairtree11/26/19Execution crashed twice for "unable to delete staging" file issues, resume option had no issues running
full run (4,656,669 items)7 dayslegacypairtree2/4/20No issues observed for successful full migration run.  Required deployment of new filesystem with large inode limit.


Repository environment #2: "citations". 

...

Number

of objects

Execution

Time

Source

Layout

Dest.

Layout

Migration

tool version

Notes
200032 minakubrapairtree11/26/19Includes 30 min to build the index.  Hung on completion-could not delete index.
10,00042 minakubraflat11/26/19Includes 30 min to build the index.  Hung on completion-could not delete index.
554,69513 hoursakubraflat11/26/19Attempted to migrate 1M records.  Includes 30 min to build the index.  Crashed due to 

UnrecognizedPropertyException.

full run (3,830,777 items)5 daysakubratruncated2/4/20No issues observed for successful full migration run.


Brown Univ


University of Wisconsin - Madison

...

  1. Storage environment:  for the purposes of this test (and for our real migration), we are migrating from one CIFS-mounted remote filesystem to another CIFS-mounted remote filesystem.
  2. Speed: TBD
  3. Datastream index:  takes about XX 1h10m minutes to build, and occupies 327MB of disk space.
  4. CPU time. Consumes about 15%.
  5. Source layout.  Akubra hash storage, using the pattern "#/##/##" for both datastreams and objects.OCFL storage:  Pairtree.  It will be good when the OCFL storage profile specification is set and incorporated into migration-utils, so that we can define the OCFL layout, similar to how we can specify the Akubra filesystem layout
  6. Average seconds per object is calculated based on the difference between the time the first object is processed (after the datastream index has been generated) and the time the last object is processed.

Issues

Migration Tests

UW Digital Collections Center Production Repository

Fedora 3: Approx. 390561,000 objects (382GB559GB): mostly books, pages and still images, with some audio, video, and PDF resources.  Approximately 2.33M 36 million datastreams (610.3TB). Content objects have one binary datastream and 5 XML metadata datatstreams.  Container objects have ~5 XML metadata datastreams.  All datastreams are either inline or managed (no external or redirect datastreams).

Fedora 3.8.1.  Migration run on desktop workstation VM with 8 4 cores, 16 8 GB RAM.  CentOS Linux release 78.72.1908 2004 (Core), Intel(R) CoreXeon(TM) i7-6700 CPU @ 3.40GHzR) Gold 5220 CPU @2.20GHz

Command run:

Code Block
languagebash
titleUW Madison migration-util command line
$ java -jar target/migration-utils-4.4.1-SNAPSHOT-driver.jar --migration-type=FEDORA_OCFL --source-type=akubra --datastreams-dir=/fedora3-prod/fedora/datastreams --objects-dir=/fedora3-prod/fedora/objects --target-dir=/fedora-migration-test --layout=pairtree --index-dir=/var/tmp/datastream-index


Number
of objects

Execution
Time

Source

Layout

Dest.

Average seconds per objectOCFL repository size
Layout

Migration
tool version

Notes
1000Datastream index
X hoursAkubrapairtree02/02/20 (cd7ece7)XXMB1000X minAkubrapairtree02/02/20 (cd7ece7)1K fedora items produced X+ files100,000X hoursAkubrapairtree02/02/20 (cd7ece7)All 390,000X hoursAkubrapairtree
: 1h17m
OCFL repo: 4h36m
2.9 sec133GB


(81586bf )

with param --pid-file=1000pids.txt
datastream index cleared after run
10,000Datastream index: 1h5m
OCFL repo: 11h48m
4.3 sec147GB
(81586bf )


with param --pid-file=10000pids.txt
datastream index cleared after run

Most objects are XML docs in this batch.

100,000Datastream index: 1h9m
OCFL repo: 3d20h16m
3.3 sec1.6TB

 
(4a9f19c)

with param --pid-file=100000pids.txt
datastream index cleared after run
All 561,000

Datastream index: 1h10m
OCFL repo:
20d21h12m

3.2 sec9TB

 
(43b7bae)

all pids
02/02/20 (cd7ece7)