Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.
Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.
If we create the most naive implementation of the IndexBuilder, we might wind up asking for the same RDF chunk many, many times. Faculty belong to the same diepartment, Co-authors write the same document, etc. This might not be a significant problem. We should try the naive way and see how much duplication there is, as a percentage of the total load.
When we go through the "Discovery" phase, we presumably get unique URIs. We could keep track of them, so we don't fetch them again if they are co-authors on each other's documents, etc. Of course, the "Linked Data Expansion" phase can reveal many additional URIs for which we must fetch data, and we would need to keep track of them also. Ideally, of course, this would work with Hadoop, so if one node fetched RDF for a URIs, no other node would fetch for that same URI. This becomes challenging.
The HttpClient can be configured to add a caching layer. See the manual page: http://hc.apache.org/httpcomponents-client-ga/tutorial/html/caching.html
If we use the standard memory-based cache, we might reduce the duplication in RDF requests. If a single Hadoop node fetched the RDF for a particular URI, it would not fetch it a second time. This would require no logic in our code, since our code would make the repeated HTTP request, and the HttpClient would simply find the result already in the cache. There are tuning questions here: how many bytes can we devote to caching? How much will the cache overflow? We will also be caching many pages that will not be asked for a second time, but with a memory-based cache, there is very little penalty for that.
Again, each Hadoop node would maintain its own cache, so some duplication would undoubtedly persist.
The HttpClient caching layer allows us to simply implement our own storage. We only need a mechanism by which the HTTP response can be stored, using an arbitrary String as a key, and can then be retrieved or deleted by that same key: http://hc.apache.org/httpcomponents-client-ga/httpclient-cache/apidocs/org/apache/http/client/cache/HttpCacheStorage.html
If we implemented the HttpCacheStorage to use the HDFS (Hadoop Distributed File System) as its storage area, our cache could be effective across all Nodes. Further, it could even persist from one run of the program to the next, although we would expect that many or most of the responses would have expired by then.
First thing to determine – are we seeing a significant percentage of duplicated requests? Can we reason about this, or do we need to instrument it?
The caching layer acts as a decorator on HttpClient, making it easy to combine both a memory-based cache and an HDFS-based cache. When a page is retrieved, it would stored in both caches, which is no improvement. However, when we check to see whether a page is in the cache, we first check in memory. If we find the page there, we look no farther. If we have significant repitition, this could improve efficiency.
The memory-based cache is easy to do. Very easy. The HDFS-based cache is more work.
The big question is, what do we save? If we have very little repetition, we might actually see a loss of performance by introducing a cache, or by using a cache that was improperly tuned.