Poor Performance when using Spatial Indexes in MySQL
Asked Answered
U

4

9

I'm trying a little experiment at pushing a data set which is not geo-spatial but fits it quite well and am finding the results somewhat unsettling. The data set is genomic data e.g. the Human Genome where we have a region of DNA where elements like genes occupy specific start and stop coordinates (our X axis). We have multiple regions of DNA (chromosomes) which occupy the Y axis. The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate e.g. LineString(START 1, END 2).

The theory seemed sound so I pushed it into an existing MySQL based genome project and came up with a table structure like:

CREATE TABLE `spatial_feature` (
  `spatial_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `external_id` int(10) unsigned NOT NULL,
  `external_type` int(3) unsigned NOT NULL,
  `location` geometry NOT NULL,
  PRIMARY KEY (`spatial_feature_id`),
  SPATIAL KEY `sf_location_idx` (`location`)
) ENGINE=MyISAM;

external_id represents the identifier of the entity we have encoded into this table & external_type encodes the source of this. Everything looked good and I pushed in some preliminary data (30,000 rows) which seemed to work well. When this increased past the 3 million row mark MySQL refused to use the spatial index and was slower when it was forced to use it (40 seconds vs. 5 seconds using a full table scan). When more data was added the index started to be used but the performance penalty persisted. Forcing the index off brought the query down to 8 seconds. The query I'm using looks like:

select count(*)
from spatial_feature
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);

The data going into this is be very dense along the Y dimensions (think of it like you've recorded the position of every building, telephone box, post box and pigeon on a very long road). I've done tests of how R-Indexes behave with this data in Java as well as others in the field have applied them to flat-file formats with success. However no one has applied them to databases AFAIK which is the goal of this test.

Has anyone out there seen a similar behaviour when adding large quantities of data to a spatial model which is not very disparate along a particular axis? The problem persists if I reverse the coordinate usage. I'm running the following setup if that's a cause

  • MacOS 10.6.6
  • MySQL 5.1.46

Help!

Also bringing in explain plan in

+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
| id | select_type | table           | type | possible_keys   | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | spatial_feature | ALL  | sf_location_idx | NULL | NULL    | NULL | 3636060 |    33.33 | Using where |
+----+-------------+-----------------+------+-----------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

The re-written SQL looks like the following

select count(0) AS `count(*)` from `arabidopsis_thaliana_core_7_60_9`.`spatial_feature` where intersects(geometryfromtext('LineString(7420023 1, 7420023 1)'),`arabidopsis_thaliana_core_7_60_9`.`spatial_feature`.`location`)

Still not highlighting why this query's performance is so poor

After reading the article posted by @Fraser from rickonrails it seems like the problem is all to do with the index not being in memory. If I apply similar techniques to those mentioned in the article (making key buffer very big indeed) and I then force the query to use the index query times plumet. We still see a lag between querying a region & then searching for a subset of the region but it's all pointing to getting the load of the indexes correct.

What's the moral of the story? R-Indexes in MySQL have quite poor performance until they are in memory and then they have excellent performance. Not really a good solution for what I wanted to do wit them but still it provides an interesting angle on MySQL.

Thanks for all the help people.

Urceolate answered 21/1, 2011 at 11:15 Comment(7)
you might get an answer at gis.stackexchange.comBrenda
Cheers for the info will do a post on there as wellUrceolate
Can you post results of this query: EXPLAIN EXTENDED select count(*) from spatial_feature where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location); This would show how MySQL is executing it. That might highlight the bottleneck.Riella
Extending the question to give the results of the explain plan. The plan still does not highlight any real reason why the performance is so poor.Urceolate
Read this fella - rickonrails.wordpress.com/2009/03/30/…Inelegance
So it looks like the issue is with the way the index is structured on disk & that MySQL has been unable to load it all into memory. I'll try a full load into memory & will post back the results.Urceolate
Can also take a look at dev.mysql.com/doc/refman/5.0/en/using-spatial-indexes.htmlBerhley
E
1

From the EXPLAIN planwe see that although the spatial might be used for the query ('possible_keys' column) , it is not used (NULL in 'key' column). I am not sure why it is not selected automatically, but you may explicitly instruct MySql to use the index by specifying it in the query using a 'force index' clause:

select count(*)
from spatial_feature 
force index (sf_location_idx) --  <== this is the 'force index' clause
where MBRIntersects(GeomFromText('LineString(7420023 1, 7420023 1)'), location);
Emprise answered 13/5, 2014 at 9:39 Comment(2)
BTW, my mysql version is 5.5Emprise
This is the answer I was thinking too. I know in older versions of MySQL that index hints didn't always do what I expected - not sure if that fault was in the implementation or my expectations, but it would be interesting to see if this works for the OP.Polymerize
G
0

The goal is to bring back all the items which intersect two X coordinates along a single Y coordinate

Have you considered using an index wit multiple fields? Like:

CREATE INDEX spacial_search ON spatial_feature(y, x)

If you are working with a limited set of y values this is the way to.

Gies answered 7/4, 2011 at 8:19 Comment(0)
I
0

I have a degree in Genetics and I am a programmer, you don't need be using an X and a Y as your nomenclature it'll get far too faffy... you need a start and a stop position (not an "axis") and a chromosome number. You index by the chromosome number first then the position and then you index the position then the chromosome. (Question: Are you dealing in eukaryotes or chromosomes that can have two reading frames?)

EG: (where "x" = position and "y" = chromosome)

CREATE INDEX spatial_index_1 ON spatial_feature(chromosome, position);
CREATE INDEX spatial_index_2 ON spatial_feature(position, chromosome);

Incidentally Chromosomes are very long strings (just like data) you can (to speed things up dump them as such as blobs (i.e. coding genes and junk DNA)

Inappreciable answered 19/6, 2012 at 18:31 Comment(4)
I was attempting to simplify the issue a bit more to try to reach a much large audience in the hope that it is something we bioinformaticians have gotten wrong. It seems the issue isn't in creating a spatial index which allows for this kind of searching (I've applied the same technique in programming memory to great success). More is it possible to use MySQL's spatial extensions. It seems even though they can outperform any kind of indexing for this type of data (binning, linear index or max feature size) the index must be in memory. That's a deal breaker if you've got 2300 DBs on 1 serverUrceolate
MySQL is compiled with certain "defaults" if you are brave and feel capable of going down this rabbit hole there are the forge.mysql guys.. you can compile mysql do obey certain mechanics.. it's hard core but it might be worth it forge.mysql.com/wiki/Top10SQLPerformanceTipsInappreciable
there is a thing called ZFS there is a paper of getting ZFS to run in MySQL forge.mysql.com/w/images/9/9e/MySQL_and_ZFS.pdf it's basically the dogs b********Inappreciable
Thanks for the links. The ZFS+MySQL looks really good (not surprising considering it's Sun originally). We're also investigating lots of other things just to see if the overall performance of MySQL can be improved but if a new indexing strategy can be applied that's faster than the current strategy then that's a game changer in my book :)Urceolate
C
0

Are you sure a relational database is the way to go? If I were you I'd look at pushing your datasets to Solr or Elastic Search (probably storing the master datasets elsewhere). These engines are built for indexing, you will notice the difference in response times.

Corundum answered 11/11, 2014 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.