Database Tuning for SDB (/MySQL)

This is a work in progress / discussion starter. It is not - currently, at least - intended to provide a comprehensive view of tuning databases. Instead, this is presented as a set of observations and techniques to inform implementing sites on aspects they may wish to follow.

Introduction

Search the web, and you'll find a number of resources comparing the use of SPARQL on various triples stores - e.g. BigOWLIM, Virtuoso, Jena TDB and even Jena SDB - using standard tests such as the Berlin SPARQL Benchmark. These usually show SDB (on MySQL) as being the slowest, and even the Jena project states that SDB is "deprecated", suggesting to use TDB (this may be more of a reflection of committer activity and interest than really technical capability).

However, this may belie what is happening under surface - for instance, whilst Virtuoso is often shown as being about twice as fast for querying (and a lot faster at loading) than TDB, it actually has a Quad table layout with hashed columns much like SDB at it's core. So what makes it so much faster than SDB, or even TDB faster than SDB?

One advantage that "dedicated" triple stores (or rather, engines that support quads / triples explicitly), may be optimised by default for the specific layout and typical queries used. Whereas SDB relies on external, generic SQL stores, which won't - out of the box - be tuned to only have large Quad tables.

Software and Hardware Used

(At the time of writing) All of the work presented here was conducted on a 2015 MacBook Pro - 2.5Ghz i7, 16GB RAM, exceedingly fast 500GB SSD storage.

MySQL 5.6.26 Community Server, as downloaded from http://dev.mysql.com/ was used as the SQL back end.

Dataset

A "realistic" VIVO-ISF triple dataset was loaded, consisting of over 24 million triples (asserted and inferenced combined), 140,000 people, and 155,000 publications.

Out of the Box

Simply using the out of the box configuration of MySQL, performance on this dataset was largely encouraging. With VIVO 1.8.1, the majority of profile pages rendered in under 2 seconds, and the worst case ones - approximately 1500 publications - still took under 5 seconds. It's likely that much of this can be attributed to the performance of the SSD storage being well equipped to support IO bound tasks.

However, even with reduced overheads and better querying in VIVO 1.8.1, there were still some examples, where - on the default settings - pages would take a longer time to render. For example:

Large profile Co-Author network: 20 seconds

Map of Science / Temporal Graph cache warming: 2 minutes

Compared to VIVO 1.8, these represent reasonable performances - the same co-author network in 1.8 took 10 minutes to fully render the Flash page, and the Map of Science / Temporal Graph visualization models took too long to build to actually measure.

But even so, what actually is going on under the hood?

In Depth: Co-Author Network Visualization

This is one area that received a couple of benefits in VIVO 1.8.1:

1) RDFService was used directly to execute SPARQL, rather than a query on a service backed Dataset proxy (1 query vs many finds)

2) Split of the SELECT into CONSTRUCT and SELECT.

CONSTRUCT vs SELECT

It has - for example, in the context of the list view configurations - been presented that using a CONSTRUCT rather than direct SELECT is an SDB optimisation, where OPTIONAL queries are involved. That may only be partially true. Certainly - and very much so in an out of the box configuration - SDB can struggle with SELECTs that other engines handle more easily.

However, contrary to speculation on support forums for the triple stores, CONSTRUCT statements with large numbers of UNIONs are not particularly bad for alternative triple stores. Testing the VIVO 1.8.1 authorship queries directly against Virtuoso's SPARQL endpoint for a large profile gave results of 1271ms for the CONSTRUCT, and 1396ms for the SELECT. It is still faster to generate the temporary small model from Virtuoso than it is to execute the necessary SELECT directly.

True, there may not be much in it for a triple store that can optimize the SELECT to more aggressively, but it isn't really any worse off. Logically, this shouldn't be surprising, as it should be easier for any triple store to recognise and optimise the use of the same basic graph pattern multiple times in the same query, than it is to determine the selectivity of two BGPs and order their execution efficiently.

So, whilst the introduction of a CONSTRUCTed model in the Co-Author network may be particularly helpful for SDB, it's not specifically an SDB optimisation, and should provide the best, and most predictable performance across the widest range of triple stores.

This leads to the expensive part of the Co-Authorship Network Visualisation being the CONSTRUCT query to generate the temporary model (the SELECT on this in memory model is only a couple of hundred milliseconds).

The SPARQL query that is being executed takes the form:

Query for profile: http://localhost/individual/author

PREFIX foaf: <>
PREFIX core: <>
PREFIX rdfs: <>
PREFIX local: <http://localhost/>
CONSTRUCT
{
    <http://localhost/individual/author> local:authorLabel ?authorLabel .
    <http://localhost/individual/author> local:authorOf ?document .
    ?document local:publicationDate ?publicationDate .
    ?document local:coAuthor ?coAuthorPerson .
    ?coAuthorPerson rdfs:label ?coAuthorPersonLabel .
}
WHERE
{
    {
        <http://localhost/individual/author> rdf:type foaf:Person ;
                rdfs:label ?authorLabel ;
                core:relatedBy ?authorshipNode .
        ?authorshipNode rdf:type core:Authorship ;
                core:relates ?document .
        ?document rdf:type <http://purl.obolibrary.org/obo/IAO_0000030> ;
                core:relatedBy ?coAuthorshipNode .
        ?coAuthorshipNode rdf:type core:Authorship ;
                core:relates ?coAuthorPerson .
        ?coAuthorPerson rdf:type foaf:Person ;
                rdfs:label ?coAuthorPersonLabel .
    }
    UNION
    {
        <http://localhost/individual/author> rdf:type foaf:Person ;
                rdfs:label ?authorLabel ;
                core:relatedBy ?authorshipNode .
        ?authorshipNode rdf:type core:Authorship ;
                core:relates ?document .
        ?document core:dateTimeValue ?dateTimeValue .
        ?dateTimeValue core:dateTime ?publicationDate .
    }
}

The quirk of "local"

Notably, the above query uses a "local" prefix. In VIVO, most CONSTRUCT statements are written to generate models that match the underlying source ontologies, such that the SELECT queries executed on those models could also just as easily run against the source triple store directly.

If you know that you are always going to be executing against a CONSTRUCTed temporary model - and when the CONSTRUCT is usually the best case, you might as well commit to doing it - then there is no need to replicate the original model in the reduced set, when you can easily collapse certain graphs into more descriptive "invented" predicates, reducing the model size, reducing the complexity and increasing the specificity of the following SELECT.

Space shortcuts

Page tree

Introduction

Software and Hardware Used

Dataset

Out of the Box

In Depth: Co-Author Network Visualization