Page History

...

Code Block

language	text

join_buffer_size = 32M
read_rdrnd_buffer_size = 32M
innodb_buffer_pool_size = 512M1536M
innodb_file_per_table = 1
innodb_file_format = barracuda

...

Whilst this does not affect the initial execution of the SQL query (20 seconds), it does make subsequent executions for the same profile faster - approximately 14 seconds.

It's not known what size buffers should really be used for any given database size, or how well this translates into overall performance, e.g. how much can be cached to the benefit of the entire application. Of course, each individual site would need to look at the resources available on their own systems, and what they can dedicate to a given purpose.

Even Deeper Explanation

So far, we've just scratched the surface of MySQL, and put in a basic tweak to the configuration, but we haven't really looked in depth at what the query is actually doing.

Like other database engines, MySQL provides execution plans for queries, so lets go ahead and ask MySQL to give us the execution plan - by adding EXPLAIN to the start of the query:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	extra
1	PRIMARY	<derived2>	ALL					852
1	PRIMARY	R_1	eq_ref	PRIMARY	PRIMARY	8	S_1.X_4	1
1	PRIMARY	R_2	eq_ref	PRIMARY	PRIMARY	8	S_1.X_3	1
1	PRIMARY	R_3	eq_ref	PRIMARY	PRIMARY	8	S_1.X_2	1
1	PRIMARY	R_4	eq_ref	PRIMARY	PRIMARY	8	S_1.X_6	1
1	PRIMARY	R_5	eq_ref	PRIMARY	PRIMARY	8	S_1.X_1	1
1	PRIMARY	R_6	eq_ref	PRIMARY	PRIMARY	8	S_1.X_5	1
2	DERIVED	Q_1	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	const,const,const	1	Using where; Using index; Using temporary
2	DERIVED	Q_2	ref	SubjPredObj,PredObjSubj	SubjPredObj	16	const,const	1	Using where; Using index
2	DERIVED	Q_3	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	const,const	852	Using where; Using index
2	DERIVED	Q_4	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_3.o,const,const	1	Using where; Using index
2	DERIVED	Q_5	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_3.o,const	1	Using where; Using index
2	DERIVED	Q_6	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_5.o,const,const	1	Using where; Using index
2	DERIVED	Q_7	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_5.o,const	1	Using index
2	DERIVED	Q_8	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_7.o,const,const	1	Using where; Using index; Distinct
2	DERIVED	Q_9	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_7.o,const	1	Using where; Using index; Distinct
2	DERIVED	Q_10	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_9.o,const,const	1	Using where; Using index; Distinct
2	DERIVED	Q_11	ref	SubjPredObj,PredObjSubj	SubjPredObj	16	Q_9.o,const	1	Using index; Distinct

Info

title	Unused Tweaks

Reading the documentation on MySQL, there is in 5.6 a feature called "batched key access", which on the surface should help with large joins. However, when tested on this query, it appeared to make no difference

It may be that innodb_buffer_pool_size is the main reason for the performance boost - having a sufficiently large buffer pool means that InnoDB can generate in-memory adaptive hash indexes. Hash indexes (which are only available in memory) are far more efficient for the types of joins that SDB is doing, which are all based on hash values anyway.

Info

There are a number of different philosophies for sizing the innodb_buffer_pool_size. On a dedicated server, MySQL recommends 70% of the available memory. If you are hosting Tomcat on the same server, then maybe that should be no more that 70% of the system memory minus the max Tomcat heap size.

Actual requirements depends on data shape and usage patterns - for the SDB layout, after accessing a number of different pages in VIVO and checking the free buffers, a good setting seems to be around 1MB for every 16,000 triples - e.g. 1.5GB for a 25million triple store. If you have the resources, increasing the buffer size doesn't hurt performance - in which case allocating at least 100MB per 1million triples would be a good idea.

The InnoDB buffer pool can be pre-loaded at startup, which would help the initial query performance times - see: https://dev.mysql.com/doc/refman/5.7/en/innodb-preload-buffer-pool.html

It's not known what size buffers should really be used for any given database size, or how well this translates into overall performance, e.g. how much can be cached to the benefit of the entire application. Of course, each individual site would need to look at the resources available on their own systems, and what they can dedicate to a given purpose.

Even Deeper Explanation

So far, we've just scratched the surface of MySQL, and put in a basic tweak to the configuration, but we haven't really looked in depth at what the query is actually doing.

Like other database engines, MySQL provides execution plans for queries, so lets go ahead and ask MySQL to give us the execution plan - by adding EXPLAIN to the start of the query.

Doing so, gives the following output:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	extra
1	PRIMARY	<derived2>	ALL					852
1	PRIMARY	R_1	eq_ref	PRIMARY	PRIMARY	8	S_1.X_4	1
1	PRIMARY	R_2	eq_ref	PRIMARY	PRIMARY	8	S_1.X_3	1
1	PRIMARY	R_3	eq_ref	PRIMARY	PRIMARY	8	S_1.X_2	1
1	PRIMARY	R_4	eq_ref	PRIMARY	PRIMARY	8	S_1.X_6	1
1	PRIMARY	R_5	eq_ref	PRIMARY	PRIMARY	8	S_1.X_1	1
1	PRIMARY	R_6	eq_ref	PRIMARY	PRIMARY	8	S_1.X_5	1
2	DERIVED	Q_1	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	const,const,const	1	Using where; Using index; Using temporary
2	DERIVED	Q_2	ref	SubjPredObj,PredObjSubj	SubjPredObj	16	const,const	1	Using where; Using index
2	DERIVED	Q_3	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	const,const	852	Using where; Using index
2	DERIVED	Q_4	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_3.o,const,const	1	Using where; Using index
2	DERIVED	Q_5	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_3.o,const	1	Using where; Using index
2	DERIVED	Q_6	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_5.o,const,const	1	Using where; Using index
2	DERIVED	Q_7	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_5.o,const	1	Using index
2	DERIVED	Q_8	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_7.o,const,const	1	Using where; Using index; Distinct
2	DERIVED	Q_9	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	16	Q_7.o,const	1	Using where; Using index; Distinct
2	DERIVED	Q_10	ref	SubjPredObj,PredObjSubj,ObjSubjPred	SubjPredObj	24	Q_9.o,const,const	1	Using where; Using index; Distinct
2	DERIVED	Q_11	ref	SubjPredObj,PredObjSubj	SubjPredObj	16	Q_9.o,const	1	Using index; Distinct

So, what does this tell us about the query? Well, first of all it looks like there is very good index use, so the most obvious thing you look for - index usage - is already covered.

But look carefully at the 8 row - the entry for table Q_1 - what is that it says at the end? "Using temporary".

Under certain circumstances, MySQL can use a temporary table to answer DISTINCT queries, and when we look back at the original SQL, there is a SELECT DISTINCT on Q_1. Now, as an experiment, if we remove the DISTINCT, we actually get the same results from the query - but it returns in a little over 6 seconds. SIX SECONDS! This query originally took 20...

But we can't just remove the DISTINCT - because that's part of the SQL generated from SDB, and we would need to go in and modify SDB. Hmmm. So what can we do about the temporary table?

It's not clear what is going into the temporary table, and therefore how big it is. But knowing that it is select 4 hashes per row, each hash is 8 bytes, and there are 200,000 rows, that's at least 9M. Possibly up to 67M (11 tables, 4 hashes per table, 200,000 rows).

The amount of memory allocated by default to temporary tables is 16M, after which it creates temporary tables on disk. The temporary table above being written to disk is certainly going to kill performance!

There are two settings in my.cnf that need to be changed to alter the available temporary table size:

Code Block

language	text

max_heap_table_size=256M
tmp_table_size=256M

It's important to note that this is not a recommendation for a temporary table size, we're just testing the hypothesis - we may be able to get away with smaller. Alternatively, your system may need larger.

But for now, we've put in a big number, which should be sufficient for the query above. So how long does the query take?

SIX SECONDS.

From cold. Repeat the query with warmed caches, and it dips under 5 seconds. Even with the DISTINCT present. For a query that took 20 seconds originally.

Info

title	Unused Tweaks

Reading the documentation on MySQL, there is in 5.6 a feature called "batched key access", which on the surface should help with large joins. However, when tested on this query, it appeared to make no difference - there was no change in either execution plan (it would say Using join buffer (Batched Key Access)) or in the execution time.

Even with a large query with OPTIONALs (e.g. the listview-authorInAuthorship select [view the Author in Authorship Query as SQL]), MySQL can't be provoked into using batched key access on a join. It does however start doing batched nested loops. This isn't hideous - under 4 seconds on a 1700+ publication profile - but not quite as fast as Virtuoso (2 seconds).

Despite the SDB layout using hashes for keys, and all joins based on equivalence, the architecture of MySQL is quite heavily leveraged on BTREE indexes, rather than hash indexes. MariaDB (a MySQL fork) adds batched hash joins to those offered by MySQL - although when MariaDB was provoked into using them, observed performance was awful.

MariaDB uses Aria engine for temporary tables by default, in those cases setting the aria_pagecache_buffer_size can make a profound difference. For the authorship query, execution time dropped as low as 2.5 seconds for repeated executions.

Returning to Map of Science

Without an in depth look at the queries being executed by the Map of Science / Temporal Graph, what difference has the new MySQL settings made?

Originally, "out of the box", the cached model took a little over 2 minutes to build. With the new settings in place - and no other changes - it takes just 25 seconds.

Conclusion

Despite the bad press, it is evident that you can make a LOT of difference to the performance of SDB by taking time to look at optimisations that can be made to the SQL engine. That's the trade-off of using a general purpose SQL core instead of an engine that already knows what data structures and queries it is trying to optimise for. But there is a lot of scope for improving performance simply by tuning the core, without affecting the application or the SPARQL queries that it is using.

Space shortcuts

Page tree

Versions Compared

Old Version 3

New Version Current

Key

Even Deeper Explanation

Even Deeper Explanation

Returning to Map of Science

Conclusion