Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 

-------------------------------------

 

 

 New Modeshape FileSystemConnector option to compute hash from path rather than full content

 

The rate limiting step in using the FileSystemConnector for large files is the computation of the checksum used as the binary key of the files upon accessing them which can take on the order of hours for files in the 100 GB range. Alternatives were assessed, including using openssl library to create the checksums, using faster checksum algorithms (MD5), caching the computation, or an implementation of a checksum only using the beginning and ending bytes of the large file.  Ultimately it was decided to add a new option to the modeshape FileSystemConnector. As of modeshape 3.6.0 there is now an option contentBasedSha1 to compute a hash from the path string (option=false) rather than full content of the datastream (option=true) in repository.json external sources.  Default is true.

repository.json

 

"externalSources" : {
    "federated-directory" : {
        "classname" : "org.modeshape.connector.filesystem.FileSystemConnector",
        "directoryPath" : "a/path/here",
        "projections" : [ "default:/federated-directory => /" ],
        "extraPropertiesStorage" : "none",
        "contentBasedSha1" : "false"
    } ,

 

 

The tradeoff is that using contentBasedSha1=false results in the datastream not having a SHA1 hash.  If the full content hash is needed, yet performance is an issue consider using LargeFileSystemConnector which lazily computes a hash upon calling the getHash() and getHexHash() on the BinaryValue object, storing the hash but not using it as the binary key.

 

The ModeShape File System Connector, can project one or more file/folder hierarchies from the file system into the repository.

 

Within your fcrepo4 repository you'll find this file: fcrepo-webapp/src/main/resources/spring/repo.xml
Look for the repositoryConfiguration property, this will determine which repository.json file will be used to configure the repository.
 
All external sources must be defined within this file before startup, they can't be added after Fedora is already running.
 
This is an example configuration show a single source pointing to the /tmp directory, which then gets projected on the workspace default at /projection.
 
"externalSources" : {
        "system-tmp" : {
            "classname" : "org.modeshape.connector.filesystem.FileSystemConnector",
            "directoryPath" : "/tmp",
            "projections" : [ "default:/projection => /" ]
        }
}
 
You'll find all the options for this configuration here: https://docs.jboss.org/author/display/MODE/File+system+connector
 
The important ones are directoryPath which defines what file or folder is the external source and projections which defines all the projects for the given external source (more can be added at run time). 
 
The syntax of a projection is "workspace:/path/within/workspace => /path/on/filesystem"
 
Once you have added your external sources you can start fedora, and your /tmp directory should be available from the default workspace within the repository.