Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In the current DSpace design, the database transactions are in most of the cases relatively long: from Context creation to the moment the Context is completed. Especially when doing batch processing, that transaction can become very long. The new data access layer introduced in DSpace 6 which is based on Hibernate has built-in cache and auto-update mechanisms. But these mechanisms do not work well with long transactions and even have a an exponentially adverse-effect on the performance. 

Therefore we added a new method enableBatchMode() to the DSpace Context class that which tells our database connection that we are going to do some batch processing. The database connection (Hibernate in our case) can then optimize itself to deal with a large number of inserts, updates and deletes. Hibernate will then not postpone update statements anymore which is better in the case of batch processing. The method isBatchModeEnabled() lets you check if the current Context is in "batch mode".

When dealing with a lot of records, it is also important to deal with the size of the (Hibernate) cache. A large cache can also lead to decrease decreased performance and eventually to "out of memory" exceptions. To help developers to better manage the cache, a method getCacheSize() was added to the DSpace Context class that will give you the number of database records currently cached by the database connection. Another new method clearCache() will allow you to clear the cache and free up (heap) memory. It is recommended that you clear the cache when its size is greater than 2000 records (in batch mode). Besides the clearCache() method, the commit() method in the DSpace Context class will also clear the cache, flush all pending changes to the database and commit the current database transaction. The database changes will then be visible to other threads.

BUT clearCache() and commit() come at a price. After calling this method all previously fetched entities (hibernate terminology for database record) are "detached" (pending changes are not tracked anymore) and cannot be combined with "attached" entities. If you change a value in a detached entity, Hibernate will not automatically push that change to the database. If you still want to change a value of a detached entity or if you want to use that entity in combination with attached entities (e.g. adding a bitstream to an item) after you have clear cleared the cache, you first have to reload that entity. Reloading means asking the database connection to re-add the entity from the database to the cache and get a new object reference to the required entity. From then on, it is important that you use that new object reference. To simplify the process of reloading detached entities, we've added a reloadEntity(ReloadableEntity entity) method to the DSpace Context class with a new interface ReloadableEntity. This method will give the user a new "attached" reference to the requested entity. All DSpace Objects and some extra classes implement the ReloadableEntity interface so that they can be easily reloaded.

...

  1. You put the Context into batch processing mode using the method: 

    Code Block
    boolean originalMode = context.isBatchModeEnabled(); 
    context.enableBatchMode(true);
  2. You keep an eye on the cache size and clear it when it get's to gets too big. Alternatively, you can also commit the context. This also means you have to reload entities you still want to work with: 

    Code Block
    Collection targetCollection = ...
    ...
     
    if (context.getCacheSize() > 2000) {
    	context.clearCache();
    	targetCollection = context.reloadEntity(targetCollection);
    }
  3. When you're finished with processing the records, you put the context back in it's into its original mode: 

    Code Block
    context.enableBatchMode(originalMode);

...