Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Naive

If we create the most naive implementation of the IndexBuilder, we might wind up asking for the same RDF chunk many, many times. Faculty belong to the same diepartment, Co-authors write the same document, etc. This might not be a significant problem. We should try the naive way and see how much duplication there is, as a percentage of the total load.

Convoluted

When we go through the "Discovery" phase, we presumably get unique URIs. We could keep track of them, so we don't fetch them again if they are co-authors on each other's documents, etc. Of course, the "Linked Data Expansion" phase can reveal many additional URIs for which we must fetch data, and we would need to keep track of them also. Ideally, of course, this would work with Hadoop, so if one node fetched RDF for a URIs, no other node would fetch for that same URI. This becomes challenging.

Caching

The HttpClient can be configured to add a caching layer. See the manual page: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/caching.html

...

Again, each Hadoop node would maintain its own cache, so some duplication would undoubtedly persist.

Hadoop-friendly Caching

The HttpClient caching layer allows us to simply implement our own storage. We only need a mechanism by which the HTTP response can be stored, using an arbitrary String as a key, and can then be retrieved or deleted by that same key: http://hc.apache.org/httpcomponents-client-ga/httpclient-cache/apidocs/org/apache/http/client/cache/HttpCacheStorage.html

If we implemented the HttpCacheStorage to use the HDFS (Hadoop Distributed File System) as its storage area, our cache could be effective across all Nodes. Further, it could even persist from one run of the program to the next, although we would expect that many or most of the responses would have expired by then.

Efficiencies and Penalties

First thing to determine – are we seeing a significant percentage of duplicated requests? Can we reason about this, or do we need to instrument it?

Memory-based cache:

  • How big should the cache be?
    • We can set limits on the number of files to save, and the maximum size of the files.
  • What happens when a file that is too large is retrieved? Is it just not saved?
  • What happens when we retrieve more files than the cache can hold? Is it FIFO or LIFO or by some other scheme?
  • Storing things in memory and retrieving them has no significant overhead. However, it we suck up too much memory, then we may impact the application's performance.

HDFS-based cache:

  • There are constant sources of overhead that may or may not be significant
  • Each page we store in the cache will be written to disk - perhaps on a different node, requiring a network transfer.
  • Each time we want to retrieve a page, we must first check whether it is cached. This also may require looking on a different node.

Layered cache:

The caching layer acts as a decorator on HttpClient, making it easy to combine both a memory-based cache and an HDFS-based cache. When a page is retrieved, it would stored in both caches, which is no improvement. However, when we check to see whether a page is in the cache, we first check in memory. If we find the page there, we look no farther. If we have significant repitition, this could improve efficiency.

Is this worth doing?

The memory-based cache is easy to do. Very easy. The HDFS-based cache is more work.

...