...
In the end, it sounds like our best option may be to ask the DuraCloud Team to create a way to turn off the "auto-delete" option of the DuraCloud Sync Tool. Bill Branan seemed open to this idea, and is investigating it (from the sounds of it, he thought it'd be a worthwhile and relatively easy hopefully not too difficult of a change).
Here's a basic example future DSpace/DuraCloud interaction workflow, that may allow us to perform a "trickle" synchronization of content:
...
- If "auto-delete" was turned OFF, DuraCloud will never remove any content, until you explicitly tell it to. This means we may need a "cleanup" script or have an "audit" script which can clean up unnecessary files that still exist in DuraCloud that were removed from DSpace.
- DuraCloud will accept updates to files. If you place a file into your sync folder with the name "ITEM-123454678-1.zip", and DuraCloud already has a file of that name in its storage space, DuraCloud will compare the files (via checksum). If they are different, the new file will overwrite the old file.
- BIG ISSUE If you accidentally switched the "auto-delete" option back ON, DuraCloud may auto-delete all/most of your content. Bill Branan & I discussed this as a major concern that we need to resolve in some way. Maybe DuraCloud Sync Tool needs to default to not auto-delete content? Or, at the very least, explicitly WARN you if you tried to turn ON "auto-delete".
Auditing Functionality
Once content is in DuraCloud, we need a way to audit that content and compare it to what is currently in DSpace.
...
- Export an AIP for a random DSpace Object (or a chosen one) to local filesystem
- Generate a local checksum of the exported AIP
- Using the DuraCloud REST API, compare that local checksum with the checksum for that item as stored in DuraCloud
- If the checksums match, then the content is identical (successful audit)
- If they don't match, then you know one or the other is out of sync
- Repeat as necessary for some/all objects in DSpace