Naive

If we create the most naive implementation of the IndexBuilder, we might wind up asking for the same RDF chunk many, many times. Faculty belong to the same diepartment, Co-authors write the same document, etc. This might not be a significant problem. We should try the naive way and see how much duplication there is, as a percentage of the total load.

Convoluted

When we go through the "Discovery" phase, we presumably get unique URIs. We could keep track of them, so we don't fetch them again if they are co-authors on each other's documents, etc. Of course, the "Linked Data Expansion" phase can reveal many additional URIs for which we must fetch data, and we would need to keep track of them also. Ideally, of course, this would work with Hadoop, so if one node fetched RDF for a URIs, no other node would fetch for that same URI. This becomes challenging.

Caching

The HttpClient can be configured to add a caching layer. See the manual page: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/caching.html

If we use the standard memory-based cache, we might reduce the duplication in RDF requests. If a single Hadoop node fetched the RDF for a particular URI, it would not fetch it a second time. This would require no logic in our code, since our code would make the repeated HTTP request, and the HttpClient would simply find the result already in the cache. There are tuning questions here: how many bytes can we devote to caching? How much will the cache overflow? We will also be caching many pages that will not be asked for a second time, but with a memory-based cache, there is very little penalty for that.

Again, each Hadoop node would maintain its own cache, so some duplication would undoubtedly persist.

Hadoop-friendly Caching

The HttpClient caching layer allows us to simply implement our own storage. We only need a mechanism by which the HTTP response can be stored, using an arbitrary String as a key, and can then be retrieved or deleted by that same key: http://hc.apache.org/httpcomponents-client-ga/httpclient-cache/apidocs/org/apache/http/client/cache/HttpCacheStorage.html

If we implemented the HttpCacheStorage to use the HDFS (Hadoop Distributed File System) as its storage area, our cache could be effective across all Nodes. Further, it could even persist from one run of the program to the next, although we would expect that many or most of the responses would have expired by then.

Efficiencies and Penalties

First thing to determine – are we seeing a significant percentage of duplicated requests? Can we reason about this, or do we need to instrument it?

Memory-based cache:

How big should the cache be?
- We can set limits on the number of files to save, and the maximum size of the files.
What happens when a file that is too large is retrieved? Is it just not saved?
What happens when we retrieve more files than the cache can hold? Is it FIFO or LIFO or by some other scheme?
Storing things in memory and retrieving them has no significant overhead. However, it we suck up too much memory, then we may impact the application's performance.

HDFS-based cache:

There are constant sources of overhead that may or may not be significant
Each page we store in the cache will be written to disk - perhaps on a different node, requiring a network transfer.
Each time we want to retrieve a page, we must first check whether it is cached. This also may require looking on a different node.

Layered cache:

The caching layer acts as a decorator on HttpClient, making it easy to combine both a memory-based cache and an HDFS-based cache. When a page is retrieved, it would stored in both caches, which is no improvement. However, when we check to see whether a page is in the cache, we first check in memory. If we find the page there, we look no farther. If we have significant repitition, this could improve efficiency.

Is this worth doing?

The memory-based cache is easy to do. Very easy. The HDFS-based cache is more work.

The big question is, what do we save? If we have very little repetition, we might actually see a loss of performance by introducing a cache, or by using a cache that was improperly tuned.

Page tree

Cache for the HttpClient