Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

The Sync Tool is a utility which was created in order to provide a simple way to move files from a local file system to DuraCloud and subsequently keep the files in DuraCloud synchronized with those on the local system.

...

Code Block
java -Xmx1g -jar duracloudsync-{version}.jar

Alternatively, you can set the system environment variable JAVA_TOOL_OPTIONS to a value like "-Xmx1g", which will be picked up by the SyncTool on startup, meaning that you can start up the SyncTool UI as usual.

To run the SyncTool in command-line mode with 1 GB of heap memory space, download the Jar version of the SyncTool and execute the above command followed by the command line parameter valuesthe Jar version of the SyncTool and execute the above command followed by the command line parameter values.

Large Files

When the SyncTool encounters a large file (by default, this is 1 GB+, but this can be set up to 5GB via the --max-file-size parameter) it will "chunk" that file prior to transfer to DuraCloud. This means that the file will land in DuraCloud as multiple components with an associated manifest file to indicate the set of component files and the checksum of each. As part of this process, the SyncTool will create a local temporary file for each chunk prior to transfer, as this allows the tool to generate the checksum for that chunk, and also allows retries on failure. These temporary files are stored in the default java temp directory.

The number and size of temp files which may be created depends on the number of threads and the max chunk size settings. Each thread has the potential of creating one temp file at a time and the size of the temp files can be up to the max chunk size. So multiplying the number of threads setting by the max chunk size will tell you the maximum number of GBs that may be consumed on local storage at one time. The SyncTool removes temp files as transfers complete, but if the tool it terminated abruptly, some of those temp files can be orphaned (and may require manual cleanup.)

If you'd like to change the location of where temp files are stored, this can be done with the "java.io.tmpdir" system property. This can be done on the command line, by adding  "-Djava.io.tmpdir=/yourpath" after "java" on the command line. Alternatively, you can set the system environment variable JAVA_TOOL_OPTIONS to this value ("-Djava.io.tmpdir=/yourpath") and it will be picked up as the tool starts.

Prerequisites

Info

As of DuraCloud version 47.0.0, the Sync Tool requires Java 8 11 to run. The latest version of Java can be downloaded from here.

  • You must have Java version 8 11 or above installed on your local system. If Java is not installed, or if a previous version is installed, you will need to download and install Java. To determine if the correct version of Java is installed, open a terminal or command prompt and enter

    Code Block
    java -version

    The version displayed should be 111.80.0 or above. If running this command generates an error, Java is likely not installed.

  • You must have downloaded the Sync Tool. It is available as a link near the top of this page.

...

  • To run the Sync Tool, open a terminal or command prompt and navigate to the directory where the Sync Tool is located
  • To display the help for the Sync Tool, run

    Code Block
     java -jar duracloudsync-{version}.jar --help


  • When running the Sync Tool for the first time, you will need to use these options:

    Short Option

    Long Option

    Argument Expected

    Required

    Description

    Default Value (if optional)

    -h

    --host

    Yes

    Yes

    The host address of the DuraCloud DuraStore application 


    -r

    --port

    Yes

    No

    The port of the DuraCloud DuraStore application

    443

    -i

    --store-id

    Yes

    No

    The Store ID for the DuraCloud storage provider

    The primary storage provider is used

    -s

    --space-id

    Yes

    Yes

    The ID of the DuraCloud space where content will be stored 


    -u

    --username

    Yes

    Yes

    The username necessary to perform writes to DuraStore

     


    -p

    --password

    Yes

    No

    The password necessary to perform writes to DuraStore. If not specified the sync tool will first check to see if an environment variable named "DURACLOUD_PASSWORD" exists, if it does exist the sync tool will use its value as the password, otherwise you will be prompted to enter the password. Please note that when using the environment variable or the -p parameter you must escape your password according the conventions of your commandline shell. If you're using bash for example, any dollar ($) or backslash (\) chars must be escaped with a backslash. So the password p$ssw\rd would need to be entered as p\$ssw\\rd. There are many other special characters that need to be escaped. Here is a list of bash special characters for your reference.

    Not set

    -c

    --content-dirs

    Yes

    Yes

    A list of the directory paths to monitor and sync with DuraCloud. If multiple directories are included in this list, they should be separated by a space.

     


    -j--jump-startNoNoThis option indicates that the sync tool should not attempt to check if content to be synchronized is already in DuraCloud, but should instead transfer all content. This option is best used for new data sets.Not set
    -a--prefixYesNoA prefix to be added to the content IDs of all files in the content directories as they are added to DuraCloud. The same prefix applies to all files in all content directories. For example, if a content directory is C:/users/bob/pictures with one file in it, C:/users/bob/pictures/001.jpg, and the prefix value is "bobs-pictures/", the file would be given a DuraCloud content ID of bobs-pictures/001.jpg. Note that the name of the content directory is not included in the path, so if you would like for it to appear as part of the content ID, you will need to include it in the prefix. Also note that the prefix does not need to be a directory name, it can be any value. If, however, you would like for it to appear as a directory path, do not forget to end the prefix with a "/" character.Not set

    -w

    --work-dir

    Yes

    No

    The state of the sync tool is persisted to this directory. If not specified, this value will default to a directory named duracloud-sync-work in the user's home directory.

    duracloud-sync-work

    -f

    --poll-frequency

    Yes

    No

    The time (in ms) to wait between each poll of the sync-dirs

    10000 (10 seconds)

    -t

    --threads

    Yes

    No

    The number of threads in the pool used to manage file transfers

    3

    -m

    --max-file-size

    Yes

    No

    The maximum size of a stored file in GB (value must be between 1 and 5), larger files will be split into pieces

    1

    -n

    --rename-updates <suffix>

    NoNo

    Indicates that when a local file is changed, the original copy of the file in DuraCloud should be renamed prior to the new local version being transferred to DuraCloud. The newest version of the file will retain the original file name while older versions will have a suffix value along with a date appended to the name. For example, a local file named "myfile.txt" is copied to DuraCloud by the SyncTool. The local file is updated, and the SyncTool is run again with this parameter enabled. The result is that DuraCloud will contain "myfile.txt", which is the updated version of the file, and "myfile.txt.orig.<date>" (with <date> replaced by the date on which the file was updated) which is the original version of the file. If "myfile.txt" is updated again, another version file will be created.

    Specify an optional suffix to override default ( "orig"). To prevent updates altogether, see option -o. (Note that this option cannot be used together with either the -o or the -d options.)

     orig
    -o--no-updateNoNo

    Indicate that changed files should not be updated. In order to perform updates without overwriting, see option -n.

     

    -d

    --sync-deletes

    No

    No

    Indicates that deletes performed on files within the content directories should also be performed on those files in DuraCloud; if this option is not included all deletes are ignored

    Not set

    -x

    --exit-on-completion

    No

    No

    Indicates that the sync tool should exit once it has completed a scan of the content directories and synced all files; if this option is included, the sync tool will not continue to
    monitor the content dirs

    Not set

    -l

    --clean-start

    No

    No

    Indicates that the sync tool should perform a clean start, ensuring that all files in all content directories are checked against DuraCloud, even if those files have not changed locally since the last run of the sync tool

    Not set

    -e--excludeYesNo

    The full path to a file which specifies a set of exclusion rules. The purpose of the exclusion rules is to indicate that certain files or directories should not be transferred to DuraCloud. The rules must be listed one per line in the file. The rules will match only on the name of a file or directory, not an entire path, so path separators should not be included in rules. Rules are not case sensitive (so a rule "test.log" will match a file "test.LOG"). The rules may include wildcard characters ? and *. The ? matches a single character, while * matches 0 or more characters.

    Examples of valid rules:
    test.txt          :  Will match a file named "test.txt"
    test               :  Will match a file or directory named "test"
    test.*             :  Will match files like "test.jpg", "test.txt", "test.doc", etc
    *.log              :  Will match files named "test.log" or "daily-01-01-2050.log" as well as a directory named ".log"
    backup-19??  :  Will match a directory named "backup-1999" but not "backup-190000" or "backup-2000"

    Not set


...

Running the Sync Tool in exit on completion mode works best when the tool is run on a scheduled basis. A popular choice for handling this type of task is the cron utility. To run daily using cron a script should be placed in /etc/cron.daily. The script would look something like: 


Code Block
#!/bin/bash

if ps -ef | grep -v grep | grep duracloudsync ; then
  echo 'DuraCloud Sync is Running'
  exit 0
else
  echo 'Starting DuraCloud Sync'
  java -jar duracloudsync-{version}.jar -x [parameters] >> ~/synctool-output.log 2>&1 &
  exit 0
fi

The -x parameter is included here to ensure the Sync Tool exists after completing its run. This script also includes a check to ensure that the tool is not already running before starting.