Date & Time

  • April 11th 15:00 UTC/GMT - 11:00 ET

Dial-in

We will use the international conference call dial-in. Please follow directions below.

  • U.S.A/Canada toll free: 866-740-1260, participant code: 2257295
  • International toll free: http://www.readytalk.com/intl 
    • Use the above link and input 2257295 and the country you are calling from to get your country's toll-free dial in #
    • Once on the call, enter participant code 2257295

Agenda: Community Forum Call: DSpace Performance

Open discussion on DSpace performance challenges, exchanging best practices for analysing and resolving performance problems.
How to involve users & repository managers in adequately reporting performance issues.

Preparing for the call

In preparation of the call, you could do the following:

  • List any performance problems you may have with DSpace. Make it clear which version you are using
  • List any specific performance improvements or hacks you have made
  • List any monitoring tools/diagnostics you have experience with

Meeting notes

DSpace 5 vs. DSpace 6 comparison

In the DSpace 6.0 release the performance enhancing efforts were not entirely successful. However, in the release of DSpace 6.1 these should be fixed, making DSpace 6 in general terms more performant than DSpace 5.

To test this statement it would be good if we could set up two identical server environments on which we deploy respectively a DSpace 5 and a DSpace 6. If these repositories are then populated with the exact same content we can make a objective comparison of the performance of DSpace 5 and 6.

Multiple collections issue

In DSpace 6.0 JSPUI, when a repository has many communities and collections this can cause a performance issue. In such repository, during the collection selection step in the item submission process, the collection list takes a long time to load. This issue is currently under investigation.

During the call there were some other issues reported which are related to the above. For example, for repositories with many communities and collections performance appeared to be decreasing when upgrading to newer DSpace versions for one participant. This attendee also notices performance issues in indexing repositories with many items.

The fact that these issues were not detected during the testing phase of DSpace 6.0 reflects a more general issue with DSpace performance testing. This testing is currently done on the DuraSpace Demo repository (demo.dspace.org). This repository however is usually populated with only limited amounts of communities, collections, and items. At this point we are not testing DSpace's performance on large repositories. It would be good if we could set up such testing environment for future releases.

Monitoring infrastructure for early signs of performance issues

One popular proprietary tool for server monitoring is New Relic. It can detect significant changes in the use of resources and send alerts when this happens. It also lets you know at which time an issue occurs. New Relic is also capable of pinpointing lines of code which may have caused the performance issue.

A low tech way of doing basic test of your repository's performance is by using your in-browser developer tools, which are included in many modern browsers. In most cases you can access these tools by right-clicking in your browser, and selecting an option such as 'inspect' or 'developer tools' which should pop-up a pane at the bottom of your browser screen. This pane will likely have a network tab, in which you can monitor the loading times of pages in DSpace while you are testing features. This will provide you with hard numbers you can use to compare your performance over time.

Configuration

There are several configurations which may impact your repository's performance.

Apache Tomcat

One Tomcat configuration setting you can use to increase performance is the crawler session manager, which can restrict the number of sessions for a crawler user agent. If bot traffic generates performance issues limiting the maximum amount of sessions for those bots may help.

Database

The standard PostgreSQL settings are not ideal for repositories with much traffic. For these repositories it is better to increase the maximum database connections.

During the call it was also not certain why the default PostgresQL settings allow for an unlimited number of idle connections.

Apache Solr

Solr is memory intensive, and runs alongside DSpace in the tomcat application server. This means it will have to share its available memory with DSpace.

As solr is recording all the DSpace usage events (item page views, bitstream downloads, search queries), the memory usage of solr is related to the usage of the repository. Repositories with much usage may also require more memory for their solr.

One way of limiting the memory usage of solr is not writing any robot traffic to the solr core.

Load testing

One tool which can be used for load testing is loadimpact.com, the free tier should already suffice for most repositories. It is advised to be cautious when using this tool, as increasing the load on your DSpace may eventually lead to a failure.

Another tool used by a call attendee is Apache JMeter (http://jmeter.apache.org/). This tool is free and has the capability of capturing browser settings.

How to contribute solutions back to the community

Codebase-fixes can be contributed just like any other code-fix. However, there seems to be a need to centralize more information regarding environment-specific optimizations:

  • Tomcat config
  • Postgres config
  • SOLR Config (mixed, because solr config does live within the codebase to some extent)
  • Apache HTTPD config (caching?)
  • Operating system config (Linux vs Windows)
  • ...

Call Attendees

  • No labels

24 Comments

  1. Performance problems

    ...

    Improvements / Hacks

    Mod_deflate Apache compression

    AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css text/javascript application/javascript` to the `/etc/httpd/conf.d/proxy.conf` file enables compression (on Amazon Linux).

    This requires mod_deflate http://httpd.apache.org/docs/current/mod/mod_deflate.html

    Basic: assigning enough RAM to Tomcat

    This is also a reason why you should really run DSpace on a 64-bit operating system these days, as 32bit operating systems only allow ~4GB to be assigned to a single process.

    How much is enough? Read this

    Run SOLR and/or database on a different machine

    Even though the DSpace webapp is not a distributed architecture, you can use different machines for SOLR and your database.

    Monitoring/Diagnostic tools


  2. Here are some additional notes for the discussion.  I have asked my colleagues if they remember any additional configuration changes that we have made for performance.

    DSpace 6

    How does DSpace 6 performance compare with DSpace 5?  Has the introduction of Hibernate in the DSpace API improved performance.

    DSpace Caching

    • The OAI service caches results.  I have found this problematic when testing.  I have not evaluated the benefit of this feature.
    • In XMLUI, there is an option to cache the collection hierarchy.  Do folks use this feature?

    Other features

    • We disabled the checksum checker process once our repository had more than 100,000 items.  The process was so slow it was never able to complete.
    • In our next release, we plan to increase db.maxconnection.  The default seems artificially low.
    • We have used Fail2ban as a tool to prevent specific IP addresses from flooding the server with high volumes of requests.
    • For audio and video items, we avoid storing the media in DSpace. We store video and audio files in the university's streaming server.  We provide a link to the streaming service in DSpace metadata.  We send these files to a preservation repository.
    1. DB Connection recommendations

      # Maximum number of DB connections in pool
      db.maxconnections = 70 (from 30 by default ... you can increase this even further if your db is configured to have more available)

      # Maximum time to wait before giving up if all connections in pool are busy (milliseconds)
      db.maxwait = 10000 (from 5000 before)

      # Maximum number of idle connections in pool (-1 = unlimited)
      db.maxidle = 20 (from -1 before)

      Here is the reasoning behind each of these recommendations:

      Increasing Max connections

      In a default configuration, Postgres will allow up to 100 connections. This can be verified by executing "SHOW max_connections;" against your DB.
      IF you want/need to keep the postgres default at 100, we recommend to put maxconnections to 70, which leaves 30 connections as a buffer, so whatever happens, postgres itself has spare connections and an admin can still also login to the DB.

      The big benefit is that this allows a higher count of db connections during peak periods

      Increasing maxwait

      We have seen that 5s is often too low as a break off point for a connection request, in a case where none of the connections in the pool are available. Therefore we recommend to increase this to 10.000 so that less requests will effectively hit a timeout.

      Changing maxidle to 20 instead of -1

      The unlimited setting is dangerous because each dspace jvm process you spin up has its own pool. In the current configuration, with 30 maxconnections and -1 for max idle connections, any DSpace process (including cron jobs) you start up can keep holding on to the value of maxconnections.

      IF maxconnections is increased to 70, we would definitely recommend to keep maxidle limited to 20, so idle connections don't necessarily accumulate in a specific DSpace process.

      1. Great hints. I would highly recommend that DCAT pull together some documentation based on various recommendations like these. Performance hints/tips are FAQs, and it'd be nice to have suggestions/tips documented.

        Here's a few places where such docs could be added:

      2. One thing we've tried here, to reduce the total connection consumption, is to supply the connection pool externally via JNDI, by defining it as a global Resource, giving each Context a ResourceLink to it, and naming that link in db.jndi.  In that way, all of the webapp.s share a single slightly larger pool.  bin/dspace still uses the pool settings found in dspace.cfg (which can be different).

      3. Note: For db.maxconnections, unless you take special care to share a pool across all webapps, the maximum number of connections is per-webapp, and adds up. So, if you're running XMLUI and JSPUI, for example, there's a risk of exhausting postgres' built-in default (100). In the case of running two other webapps that each create a pool based on this value, a safer value would be 35.

        1. To clarify, Chris Wilper, there is a pool defined by default in dspace.cfg:

          db.poolname = dspacepool

          Is this pool not shared by all DSpace web applications?

          1. Alan Orth actually it looks like that's no longer used, as of DSpace 1.7: https://github.com/DSpace/DSpace/commit/ab28ac1c6d06c5f83d7cc7c500371845985dadac#diff-c30b3dabf48ed041d0689c66d588a3f2L1684

            The only way I know of with today's DSpace to get a shared pool across webapps is to use JNDI.

            1. Wow, ok. Perhaps worth amending the default build.properties or whatever its equivalent is in DSpace 6+ to note that this config option is effectively a noop unless you configure one externally with JNDI?

              1. It sounds like we should just remove all mention of db.poolname.  If the pool is supplied out of JNDI, then DSpace just uses the pool that is handed to it and none of the other db.* properties have any effect.

            2. This discussion is fresh in my mind as we've just had some database capacity issues on our DSpace instance this week after the discussion here. I think the default of 30 is fine for small sites, but it might be useful to note somewhere that this is per web app, and to explain the potential impact depending on the system's PostgreSQL max_connections.

              So I'm thinking that a good metric for determining the system's PostgreSQL max_connections would be based on DSpace's db.maxconnections and how many DSpace web applications you're using:

              db.maxconnections * dspace_web_apps + 3 = max_connections

              For example, we're using oai, xmlui, rest, and solr, but it doesn't appear that solr connects to the PostgreSQL database, so assuming I'm using a db.maxconnections of 35 like you suggest above, I would do something like this to determine PostgreSQL's max_connections:

              35 * 3 + 3 = 108

              I think this is a good metric. If a site has more resources, more load, or uses fewer applications, they could adjust the equation accordingly for their environment.

              1. The various webapp.s may be using different quantities of connections.  So the above formula provides a good rough estimate to start with, but you may be able to trim it, if you have a way to monitor pool usage per-webapp.

                1. Yes, a good place to start. Mostly I'm talking to myself on a public forum so that future travelers can benefit from this knowledge. By the way, I'm not aware of a way to monitor the database connection pool usage per webapp. I have Munin stats which show connections per database user and per database, though (we're steady around 50 connections the past few months, with short spikes from time to time).

  3. A few tips/comments from our experience running DSpace 5.x:

    • We use Munin to monitor Tomcat's JVM heap usage and allocate 512MB over the average usage
    • We follow the PostgreSQL resource consumption guidelines and dedicate 10–25% of system RAM to shared_buffers
    • Enabling the Crawler Session Manager Valve in Tomcat greatly reduced the resource usage caused by search bots, though the default regular expression misses Baidu, so we added it: crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
    • Our development and testing servers all have SSDs and 12GB or more of RAM, but nightly Solr indexing on a repository with 60,000 items takes over two hours and seems linear/single threaded.
    • We run nginx in front of Tomcat for simple + secure TLS termination and serve static files like CSS, JS, and images from the XMLUI root directly from disk with gzip rather than having Tomcat do this
  4. DSpace information

    • DSpace 5.5 running on Ubuntu 14.04 (we are using new VMs now which run Ubuntu 16.04)
    • Re Bram’s recommendation: we currently run Solr and DSpace in the same machine but the database is in a separate one

    Performance problems

    • Max idle connections reached (no new connections available). It seems that now we are not getting these anymore following from Atmire’s change recommendations to DB parameters in DSpace:
      • db.maxconnections = 300
      • db.maxwait = 10000
      • db.maxidle = 50
    • Solr server stops accepting connections (we have to re-start DSpace). This is happening quite often now. See below for Atmire’s tips and recommendations which we will incorporate in our production re-deployment

    Improvements / Hacks

    Following from Atmire’s recommendations on server configuration we will incorporate the following and test to see if it helps with performance:

    • Increase the Tomcat limit to 12GB to make optimal use of the server memory (production server has 16GB available).
    • Change Tomcat’s configured Garbage Collector strategy (-XX:+UseParallelGC) to "-XX:+UseG1GC".
    • Increase the limit on open files to at least 65536. The current open files limit on the production server is 1024.
  5. Question: number of loaded classes

    New Relic JVM monitoring tells you how many loaded/unloaded Java classes there are. Comparing a number of different DSpace repositories, I either see a figure around ~22k or around ~36k. I'm not entirely sure what the source of the difference is. Could be more/less webapps deployed. Would be interested to hear from others what webapps are deployed and how many loaded classes you have. 

    Screenshots of loaded classes for different DSpaces

    In general, make sure you only deploy those webapps that you need. Don't need SWORD or REST for example? Don't deploy it.

    Tomcat session length & session persistence

    In web.xml you can set Tomcat's session timeout. We've found that setting this to 0 (no timeout) can negatively affect performance

    Tomcat session timeout
        <session-config>
            <session-timeout>30</session-timeout>
        </session-config>

    Tomcat persists sessions by default by serializing them to disk, so they can survive a restart. We've recently found out that disabling this can positively affect performance.

    http://datum-bits.blogspot.be/2012/05/how-to-disable-session-persistence.html




  6. Our repository admin Suzanne Chase asked me to post the following discussion item:

    When using the built-in item level statistics provided by DSpace, our users have reported large fluctuations in view and download stats across time periods.  One example: in January 2017 an item had 1,100 all-time views, while in April 2017 the same item had only 200 all-time views.   This appears to be a symptom of running the statistics shard process.  We are still investigating the issue.

  7. One of the performance problems mentioned is described by the following ticket: Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  8. Key issue for DSpace 5

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  9. As announced during the meeting, I just created Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  10. TODO as followup after the meeting: revise/update the "Performance Tuning DSpace" pages in the official docs:

    Performance Tuning DSpace (DSpace 6)

    Performance Tuning DSpace (DSpace 5)

    Performance Tuning DSpace (DSpace 4)

    JIRA ticket to track this work:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.


  11. Tickets for the filter-media related issues that were discussed:

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.


  12. Not sure if this one is still a problem today:

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  13. Our JMeter testing plan that I referenced at the end of the call can be found at: https://github.com/Georgetown-University-Libraries/dspace-performance-test

    The plan we provide would need to be customized to your institution's specific URL/handles/bitstreams/etc., but hopefully it provides a useful template for interested folks to get up and going without too much trouble.