Below are the results of performance testing comparing performance of Fedora-based applications with real-world data.

Plum Ingest

Ingesting a large book with 1000 100MB TIFF images, repeated with Fedora 4.5.1 release (based on Modeshape 4), and the experimental Modeshape 5 branch (in both cases, Fedora was configured to use the PostgreSQL database object store).  Durations are reported as HH:MM:SS, for batches of 100 images loaded using Princeton's Hydra head, Plum.

BatchDuration (Modeshape4)Duration (Modeshape5)Improvement
10:19:190:13:52

28.2%

20:27:030:23:1913.8%
30:39:160:33:4114.2%
40:52:130:43:4316.3%
51:06:220:56:3614.7%
61:23:291:10:4615.2%
71:41:261:26:3014.7%
82:02:221:43:0815.7%
93:17:402:37:3120.3%
103:47:483:10:1416.5%

Retrieving Objects With Many Links to Repository Objects

Compared to objects with a large number of literal properties or URI properties, objects with a large number of links to repository objects are much slower.  E.g., an object with 10,000 properties where the objects are literals or non-repository URIs can be retrieved in 200 milliseconds, but an object with 10,000 properties where the objects are repository objects takes 7-36 seconds, depending on the settings, storage backend, etc.

There are also significant differences between LevelDB and PostgreSQL/MySQL backends, with LevelDB being much faster: 7-10 seconds as opposed to 30+ seconds for the object with 10,000 links to repository objects.

Version/BranchLevelDBMySQLPostgreSQL
4.5.08n/an/a
4.5.1104336
master (a58f5a05)73229
modeshape5 (c177adc8)n/a8930

See test scripts.

Testing initially focused on:

  • using properties explicitly set on the object, as compared to IndirectContainers
  • debugging the RDF-generation code that produces the IndirectContainer triples
  • running under Tomcat instead of Jetty

However, those do not appear to significantly impact performance.  So the process of looking up which node a proxy points to and converting the node reference to a URI seem to be the problem.  The process is:

  • List the children of a direct container and load each node.
  • Load the node the proxyFor property points to.
  • Convert the member node to a URI.

Each of these steps is reasonably fast (~1msec).  But as the number of members grows, even 3 msec per member eventually adds up.  For example, a collection with 10,000 members would take 30 seconds.

Some possible options for improving performance include:

  • Caching nodes: this can improve the time to look up the member node and convert it to a URI.
  • Using properties explicitly set on the collection object instead of proxies: this can eliminate the extra node lookup for loading the proxy node.
  • Using Modeshape's internal query functionality: in theory this could be more efficient than iterating over the proxies.  However, it appears that Modeshape uses the database as a document store, and so winds up loading all of the members anyway, with performance very similar to just iterating over all the children.

 

  • No labels

44 Comments

  1. Unknown User (acoburn)

    I would be curious to know how FCREPO-1957 could possibly affect performance. The commit referred to (4bf3ecab) has not been merged with master, and even if it had, it wouldn't affect any runtime code (not yet, at least). Perhaps you mean some other commit?

    1. Unknown User (acoburn): you're right, it can't have made any difference in performance.  Now that I look at the PR more closely, I see that it's just interfaces demonstrating how the identifier processing could work.  So any slight difference in performance between that and master is just noise.  I had seen the PR that referenced that branch and thought it was further along replacing IdentifierConverter than it was, and so built it and tested against it, hoping that general cleanup of that code might have improved performance.

        1. That's certainly part of it.  I think there are three steps for converting each IndirectContainer proxy into a membership triple:

          1. Load the proxy node (FedoraResource.getChildren)
          2. Load the member node the proxy points to (ValueConverter.nodeForValue)
          3. Convert the member node into a URI

          Each of these steps is very fast (~1ms).  But when there are many members, 3ms per member adds up quickly.  With 10K members of a collection, for example, it adds up to about 30 seconds.

          Using properties instead of proxies avoids steps #1 and #2, but the overhead of step 3 remains.  Caching nodes can also help somewhat (see my proof-of-concept caching branch).  But caching and using direct links does not get the performance to be on par with using properties to link to non-repository URIs.

          So I think that we need some other angle on this problem to make working with this kind of data perform well: either remodeling the data to avoid a large number of links from one Fedora object to other Fedora objects, or finding a way to process the links as a batch (using query functionality?) instead of one-at-a-time.

          1. Unknown User (acoburn)

            I would argue that any sufficiently large repository that cares about performance must have a caching strategy. I would also argue that as long as Fedora cares about consistency (i.e. referential transparency, LDP membership, LDP containment), such performance concerns will be directly at odds with Fedora's goals as a durable datastore. This is true both for single objects with lots of in-domain references and complex objects (e.g. many HTTP requests).

            Because caching can take many forms (and depends on the needs of downstream applications), I also think one can make a good argument that caching is out of scope for Fedora's core.

            That said, there are many ways to address this, from using a simple caching reverse proxy (e.g. Varnish) to an asynchronously populated document store (e.g. Riak – this is what we are using at Amherst). Then, downstream applications can retrieve arbitrarily large and/or complex objects directly from that cache without needing to touch Fedora. Plus, response times are then measured in milliseconds. In this architecture, Fedora does what it is best at and the high performance cache does what it is best at.

             

            1. I agree that having a cache in front of Fedora is a good approach to improve performance in most situations. My Hydra application uses Solr and an IIIF image server to provide performant read-only views of the data in Fedora, and most users do not interact with Fedora at all.

              However, my curators and metadata analysts need to be able to edit an object and then retrieve an updated view of the object in a reasonable time frame.  Any data cached outside of Fedora would be invalidated by that edit, and then the users will have to wait for the full delay to generate the triples.

            2. Consistency and performance are not theoretically opposed, but in practice, choices are going to distinguish them. This is analogous to the way humanly-readable serializations were pushed out from the core. Transparency and performance have a similar relationship.

          2. This is in part about our commitment to MODE, because the overhead of step 3 is partly due to the fact that a Fedora resource is not retrieved as an URI, but as a whole data structure (the node) with all kinds of attachments and doohickeys and dangly bits. Esmé Cowles, I don't think you are going to like my suggestion, but I think we can just accept that the performance of the community implementation is ultimately limited and leave extension into further realms of scale to other implementations that make other choices for persistence.

            1. I do not like this suggestion either. I believe we can explore both caching, re-modeling, and internal performance options.

              Esmé Cowles, is it correct to assume that you want all of the "members" when you perform a GET request on the collection? In other words, would a Prefer option that limits the properties returned be useful?

              1. I don't think retrieving a subset of the membership list would help in our case.  We might be able to use the existing Prefer header to suppress all of the membership triples, for example if we were editing a collection's title and trust the cached membership list to be current.

                But to get back to A. Soroka's bigger point: in the long term, I agree that different Fedora implementations will have different strengths and the reference implementation won't necessarily focus on performance.  But in the short-to-medium term, there's only one implementation and so I think we need to address issues like this.

                1. I'm not suggesting not exploring caching or other internal performance choices. Remodeling is not an performance choice. If Fedora requires users to warp and rip apart their notions of intellectual arrangement in the face of shoddy performance, we have definitively failed.

                2. Here's another way to look at it: Channeling my inner Edwin Shin, I ask: what exactly are the performance standards we are trying to reach? Some vague notion of "fast enough" is not a goal. It's a guaranteed failure. We should have in hand some concrete numerical measurement of performance, preferably gathered from a real-world sponsored-by-some-real-site use case, or this whole conversation is a bit academic.

                  1. I completely agree that we need real-world performance goals.  The use case here is:

                    I have a pcdm:Collection with 10,000 member objects.  I want a user to be able to edit the collection object and view an updated representation of the object.  The rest of my application overhead (updating the object in Fedora, indexing the object in Solr, displaying an updated Solr view) is minimal.  So the time to retrieve the updated collection object from Fedora is effectively the time my application takes to respond.  We have committed users, so I estimate < 10s is acceptable and < 5s is good.

                    1. Okay, that's a concrete use case. I have no confidence that the current impl can actually do that, but it's a real target for which to shoot.

                      1. At this point, I'm pretty much in agreement with Unknown User (acoburn). The right move here is to get a caching layer in front of the repository, not to try to make the repository faster. That way (layering) is cheaper, simpler, and more sustainable.

                    2. Unknown User (acoburn)

                      I think it would be fruitful to look at the work Islandora is doing to address this – the front end reads from/writes to the Drupal layer (never directly to or from Fedora). Those writes are asynchronously persisted to Fedora. Conflicts are resolved with lamport clocks. This certainly adds complexity, but it is a solid (i.e. linearizable) way to handle distributed reads/writes.

                      1. We have talked about creating a REST API, inspired in part by the Islandora sync services, and using that as the primary persistence.  The default implementation would be backed by Fedora, but it would make it much easier to implement alternative persistence options if we didn't have to implement the entire LDP API.

                        1. Any problem in computer science can be solved with another level of indirection.

                          -- Attributed to David Wheeler

                          1. I think it's somewhat important to note that the Hydra project would see this as a nuclear option, however much I support it. It effectively transforms Hydra from "A Repository backed by Fedora" to a "Repository that happens to put stuff in Fedora maybe."

                            If the end of this conversation is that the use case portrayed by Esmé is unfeasible, then that'll be an important point of discussion for Hydra and Fedora implementors in general.

                            1. Certainly it could create a good opportunity for Hydra developers to talk about what they might want out of a Hydra-specific implementation of Fedora. This kind of (creative) tension between the effects of architectural decisions specific to Hydra and the desire to put our scarce development resources to common needs is a great example of why having a common API and multiple implementations is going to be so great!

                3. The good news (or bad, I guess from a certain point of view) is that the MODE/JCR implementation has not really been written with performance in mind, but correctness and (frankly) api workshopping. So if you look over the o.f.k.m.FedoraResourceImpl, there's no MODE query usage (a holdover from JCR-as-API) and a lot of post-processing in the streams (a holdover from not knowing how the repository resources would be modeled against). Since neither of those things are true anymore, there's probably a lot of potential speedup in the various getChildren implementations (Basic/Direct/Indirect as subclasses), and maybe a few in reconsidering some signatures to optimize the only-as-uri use case.

                  1. This is to say that FedoraResource still has a lot of JCR-in-abstract attached to it, and we tried not to use MODE APIs for reasons that only make sense if you want to swap another JCR backend in, which I think seems vanishingly unlikely at this point.

                    1. I was hoping that there would be some performance gains in using the JCR Query API to retrieve the UUIDs of the members instead of iterating over the proxies to get them.  But whether I used a NodeIterator or a RowIterator, Modeshape still retrieves each node from the underlying database, and so the performance is basically the same.  Since the node metadata is stored as a binary blob in PostgreSQL, it's not surprising that Modeshape would have to load each one in order to extract a property from it.

                      We could probably optimize retrieving the nodes from PostgreSQL by retrieving them in batches, which could reduce the overhead quite a bit.  But the prospect of doing that, and the timeline for getting that into a Modeshape release, make me want to explore other options first.

                      1. Something seems wrong to me here: if all MODE querying is iterative filtering on unmarshalled blobs, there's not a lot of reason for these MODE docs on query planning, etc.

                        1. It seems to me that it's not about finding the nodes. It's about retrieving them. In order to feature a resource X as an object in a triple for another resource Y, we are retrieving the node for that resource X. (When we really only want an identifier for it.)

                      2. This goes to my point that a"Fedora resource is not retrieved as an URI, but as a whole data structure (the node)". Changing that in the MODE implementation would be a massive amount of work. It would also be to begin using MODE in a really artificial way.

                        1. I'm not sure that is true: MODE has selectors for a reason, and a query planner for a reason.

                          1. Storing identifiers instead of references is to use MODE as a crummy triplestore. I would much rather spend the time implementing Fedora over a proper triplestore. In fact, I'd very much like to spend some time doing just that.

                      3. I also think that if we're talking about NodeIterator/RowIterator, we're post-processing and already too late. We need to look harder at the querying.

  2. I appreciate all the work (thanks Esme!) and the thinking (thanks all y'all!) that's going on here!  It's been a while, but all of this smells like triplestore performance issues that we had with the UCSD DAMS that ended up with a different approach to storing triples.  We're certainly not LDP compliant though.

    These performance issues are important and a lot of folks are watching to see how Fedora will deal with them.  Thankfully "Fedora" is a community as well as a product and you'll get us there.  We're lucky to have you all!

  3. Esmé Cowles, obviously there are many ways to approach this issue, one that could be constructive immediately is modeling member resources to reference their collection, versus linking from the collection to its members. Fundamentally, this issue is a variant of the "many children of a single parent" issue. 

    Not surprisingly, the following script demonstrates sub-1-second response times when requesting the collection:

    https://gist.github.com/awoods/2f790d6b0e089ef04a352d87dfd7cc3c

    Is remodeling collection membership in this way a possible option from the Hydra perspective?

    1. Reversing the membership relationship would help for collections (and seems like a good way to solve the problem).  However, that won't work for objects with many ordered parts, where the application needs to work with the proxies as a group to order them.  I'm testing objects of various sizes now (200 pages, 300 pages, etc.) and will report back on how well they perform.  The severity of this issue may be much lower if we can link from objects to collections, and objects with 500 or 600 parts perform acceptably.

    2. I have a feeling that is off the table because the properties are container-managed triples.

      1. Is there any reason an object couldn't have a container for managing collections it was a memberOf?  The container could use ore:proxyIn as the ldp:insertedContentRelation?

      2. Benjamin Armintor, can you be more specific regarding "the properties are container-managed triples"? The example we have been looking at deals with pcdm:memberOf.

  4. Hiya - so... what happened with this?  I'm visiting with John at UCSB and he's telling me about similar ingest problems with 10000 objects.  Their workaround so far has been to NOT add the items to a collection on ingest, but then add them later.  Clunky but working...

    1. There are a couple of different approaches being explored:

      • Reversing the membership relationship so objects link to collections.  In the typical case, this scales much better.  I have a branch of CurationConcerns that does this: https://github.com/projecthydra/curation_concerns/tree/member_of, and hope to test whether this really scales to 10K objects in a collection.
      • Investigation of how Fedora handles node references and converting them to URIs, to improve the basic performance.
      • Investigation of caching nodes internally in Fedora, to improve the performance of traversing nodes all the way around.

       

      1. Following up on this: creating 10K objects that link to a single collection with pcdm:memberOf works and performs very well.  Retrieving the collection or each individual object performs very well (20msec).  Indexing the collection membership in the object Solr records was already the approach CurationConcerns was taking, so it continues to perform well.

        1. That is great, Esmé Cowles. Are there any code modifications that are necessary in the Hydra stack to use pcdm:memberOf rather than pcdm:hasMember? Or is this solution one of communicating "best practice"?

          1. There are code changes: I've created branches of HydraPCDM and CurationConcerns called member_of:

            We'll need to decide whether we should keep both hasMember and memberOf collection membership in CurationConcerns, or just replace hasMember with memberOf.  So the CurationConcerns PR isn't ready to merge before some discussion and cleanup.

        2. Just to clarify Esmé Cowles: does the collection retrieval work well with inbound references, or only getting the local triples?

    2. Declan Fleming, I am interested to know what UCSB's situation is exactly. Do you have details? or can you connect John with this conversation?

      1. Hi-

        From my team, the issue we are experiencing is with large collections having so many links between objects that slows down performance during ingest; we know this is a known problem with the the Hydra community. As for the current workaround, we're just not adding the items to collections and DCE is working on a sort of pseudo collection until the performance issues are dealt with. Mark Bussey can explain in details.