Why is the same SQLite query being 30 times slower when fetching only twice as many results?
Asked Answered
C

3

32

I have been working on speeding up a query I'm using for about a week now and asked several questions about it here ( How can I speed up fetching the results after running an sqlite query?, Is it normal that sqlite.fetchall() is so slow?, How to use min() and max() in an efficient way?).

With the very useful help from the answers given there I managed to get the time down to the sqlite query taking 100.95 seconds and fetchall taking: 1485.43. This was still not enough, so after trying out some different indexes I managed to get the query time down to 0.08 seconds for one sample and the fetchall time down to 54.97 seconds. So I thought I finally managed to speed things up enough.

Then the query runs for the next sample, taking 0.58 seconds, and the fetchall taking 3952.80 seconds. For the third sample the query took 1.01 seconds and took 1970.67 seconds to fetchall.

The first sample fetched 12951 rows, the second sample 24972 rows and the third 6470 rows. I'm very curious why the first sample was so much faster to fetch the rows, when it had only about half the amount to fetch as the second example.


Code (spectrumFeature_inputValues is (1,), (2,) and (3,), from the 3 samples used.):

self.cursor.execute('begin')
self.cursor.execute("EXPLAIN QUERY PLAN "+
                    "SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'EXPLAIN QUERY PLAN: '
print self.cursor.fetchall()
import time
time0 = time.time()
self.cursor.execute("SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "WHERE spectrum.scan_start_time BETWEEN feature.rtMin AND feature.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN feature.mzMin AND feature.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
precursorFeatureIds = self.cursor.fetchall()
print 'it fetched:',len(precursorFeatureIds),'rows'
print 'fetchall took',time.time()-time0,'seconds'
time0 = time.time()
for precursorAndFeatureID in precursorFeatureIds:
    feature_has_MSMS_precursor_inputValues = (precursorAndFeatureID[0], precursorAndFeatureID[1])
    self.cursor.execute("INSERT INTO `feature_has_MSMS_precursor` VALUES(?,?)", feature_has_MSMS_precursor_inputValues)
print 'inserting took',time.time()-time0,'seconds'
self.connection.commit()

and the results:

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0754859447479 seconds
it fetched: 12951 rows
fetchall took 54.2855291367 seconds
inserting took 0.602859973907 seconds
It took 54.9704811573 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.579694032669 seconds
it fetched: 24972 rows
fetchall took 3950.08093309 seconds
inserting took 2.11575508118 seconds
 It took 3952.80745602 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 1.01185703278 seconds
it fetched: 6470 rows
fetchall took 1970.622962 seconds
inserting took 0.673867940903 seconds
It took 1972.31343699 seconds

SQLite create statements:

-- -----------------------------------------------------
-- Table `feature`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `feature` (
  `feature_table_id` INT PRIMARY KEY NOT NULL ,
  `feature_id` VARCHAR(40) NOT NULL ,
  `intensity` DOUBLE NOT NULL ,
  `overallquality` DOUBLE NOT NULL ,
  `charge` INT NOT NULL ,
  `content` VARCHAR(45) NOT NULL ,
  `intensity_cutoff` DOUBLE NOT NULL,
  `mzMin` DOUBLE NULL ,
  `mzMax` DOUBLE NULL ,
  `rtMin` DOUBLE NULL ,
  `rtMax` DOUBLE NULL ,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_feature_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_mzMin_feature` ON `feature` (`mzMin` ASC); 
  CREATE INDEX `fk_mzMax_feature` ON `feature` (`mzMax` ASC); 
  CREATE INDEX `fk_rtMin_feature` ON `feature` (`rtMin` ASC); 
  CREATE INDEX `fk_rtMax_feature` ON `feature` (`rtMax` ASC);

DROP TABLE IF EXISTS `spectrum`;
-- -----------------------------------------------------
-- Table `spectrum`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `spectrum` (
  `spectrum_id` INT PRIMARY KEY NOT NULL ,
  `spectrum_index` INT NOT NULL ,
  `ms_level` INT NOT NULL ,
  `base_peak_mz` DOUBLE NOT NULL ,
  `base_peak_intensity` DOUBLE NOT NULL ,
  `total_ion_current` DOUBLE NOT NULL ,
  `lowest_observes_mz` DOUBLE NOT NULL ,
  `highest_observed_mz` DOUBLE NOT NULL ,
  `scan_start_time` DOUBLE NOT NULL ,
  `ion_injection_time` DOUBLE,
  `binary_data_mz` BLOB NOT NULL,
  `binary_data_rt` BLOB NOT NULL,
  `msrun_msrun_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_msrun1`
    FOREIGN KEY (`msrun_msrun_id` )
    REFERENCES `msrun` (`msrun_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

CREATE INDEX `fk_spectrum_spectrum_id_1` ON  `spectrum` (`spectrum_id` ASC);
CREATE INDEX `fk_spectrum_scahn_start_time_1` ON  `spectrum` (`scan_start_time` ASC);

DROP TABLE IF EXISTS `feature_has_MSMS_precursor`;
-- -----------------------------------------------------
-- Table `spectrum_has_feature`
-- -----------------------------------------------------
CREATE  TABLE IF NOT EXISTS `feature_has_MSMS_precursor` (
  `MSMS_precursor_precursor_id` INT NOT NULL ,
  `feature_feature_table_id` INT NOT NULL ,
  CONSTRAINT `fk_spectrum_has_feature_spectrum1`
    FOREIGN KEY (`MSMS_precursor_precursor_id` )
    REFERENCES `MSMS_precursor` (`precursor_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION,
  CONSTRAINT `fk_spectrum_has_feature_feature1`
    FOREIGN KEY (`feature_feature_table_id` )
    REFERENCES `feature` (`feature_table_id` )
    ON DELETE NO ACTION
    ON UPDATE NO ACTION);

  CREATE INDEX `fk_feature_has_MSMS_precursor_feature1` ON `feature_has_MSMS_precursor` (`feature_feature_table_id` ASC);
  CREATE INDEX `fk_feature_has_MSMS_precursor_precursor1` ON `feature_has_MSMS_precursor` (`MSMS_precursor_precursor_id` ASC);

As you can see I have made indexes out of the mz and rt values in both spectrum and feature, because I figured that most time is spent comparing those numbers together.

So why is the first sample so much faster than the second and third? And how does the query time relate to the fetchall time? Most importantly, is there a way I can speed this up?


Update 1:

After talking to a collegaue it's probably because comparing a point to a 2d dimension (the rtMin, rtMax, mzMin, mzMax) will take n^2 time. This roughly corresponds to the second fetchall taking a bit more than 60^2 seconds (aproximate time the first fetchall took) and it retrieved a little less than twice the amount of rows. This doesn't answer any of my questions though.


Update 2:

I tried using R*tree as advised in the comments. I made a new table:

CREATE VIRTUAL TABLE convexhull_edges USING rtree(
   feature_feature_table_id,             
   rtMin, rtMax,      
   mzMin, mzMax,       
); 

and change my query to:

self.cursor.execute("SELECT precursor_id, feature_table_id "+
                    "FROM `MSMS_precursor` "+
                    "INNER JOIN `spectrum` ON spectrum.spectrum_id = MSMS_precursor.spectrum_spectrum_id "+
                    "INNER JOIN `feature` ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
                    "INNER JOIN `convexhull_edges` ON convexhull_edges.feature_feature_table_id = feature.feature_table_id "
                    "WHERE spectrum.scan_start_time BETWEEN convexhull_edges.rtMin AND convexhull_edges.rtMax "+
                    "AND MSMS_precursor.ion_mz BETWEEN convexhull_edges.mzMin AND convexhull_edges.mzMax "+
                    "AND feature.msrun_msrun_id = ?", spectrumFeature_InputValues)

This gave the following results:

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0572800636292 seconds
it fetched: 13140 rows
fetchall took 34.4445540905 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.819370031357 seconds
it fetched: 25402 rows
fetchall took 3625.72873998 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 3, u'SCAN TABLE convexhull_edges VIRTUAL TABLE INDEX 2: (~0 rows)'), (0, 1, 2, u'SEARCH TABLE feature USING INDEX sqlite_autoindex_feature_1 (feature_table_id=?) (~1 rows)'), (0, 2, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 3, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.878498077393 seconds
it fetched: 6761 rows
fetchall took 1419.34246588 seconds
inserting took 0.340960025787 seconds
It took 1420.56637716 seconds

So a bit faster than my previous way, but still not fast enough. Next I'm going to try web_bod's solution.


Update 3

Using web_bod's solution I got the following times:

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.0521960258484 seconds
it fetched: 13052 rows
fetchall took 90.5810132027 seconds

EXPLAIN QUERY PLAN: 
[(0, 0, 2, u'SCAN TABLE feature (~100000 rows)'), (0, 1, 1, u'SEARCH TABLE spectrum  USING INDEX fk_spectrum_scahn_start_time_1 (scan_start_time>? AND scan_start_time<?) (~3125 rows)'), (0, 2, 0, u'SEARCH TABLE MSMS_precursor USING INDEX fk_MSMS_precursor_spectrum_spectrum_id_1 (spectrum_spectrum_id=?) (~5 rows)')]
query took: 0.278959989548 seconds
it fetched: 25195 rows
fetchall took 4310.6012361 seconds

The third one sadly didn't finish because of a reboot. So this is a bit faster than my first solution, but slower than using R*Tree


Update 4

Working on a different query which was going incredibly slow I saw that it was going into an uninterrupted sleep (see this question). So I checked top while running this query and it's switching between R and D state, lowering the CPU usage from 100 to 50%. This could be why it's running so slow with all the solutions provided.


Update 5

I migrated to MySQL but I'm getting the same results.

Cohort answered 10/5, 2012 at 10:15 Comment(9)
It's quite possible that these numbers are dependent upon the environmental factors of your machine, ie; available memory, hard drive speed, size of the .sqlite database, etc. Posting these values may help here.Dilapidated
I'll post that tomorrow. However, these timings that seem strange to me are on the same machine.Cohort
Not sure if this helps you, but have you looked into using an R*Tree index? They're designed for efficient range queriesPegg
That looks very promising, I'll try it out somewhere in the next days. Thanks!Cohort
Moving the strict equality constraints to the INNER JOIN's ON clause might help. I had a similar problem with PostgreSQL, and after a lot of time mucking around with it, it wasn't resolvable without a complete schema redesign. The multiple inequalities in your WHERE clause, running against a non-uniform set of data essentially make this a non-deterministic problem. For us the only solution was implementing a statistics-driven "guessing" logic to reduce the size of data hitting the inequality heap scan (inequalities will usually go through a full heap scan, by the way, index or not).Dante
Would your data conceivably fit in memory? Because then you could convert the tables to lists of tuples and use list comprehensions for the joins.Smallwood
is there some reason to think that fetchall runs the query? you give separate times for query and fetching. in all cases query is fast enough. so why are people answering with improved queries? if fetchall is just data retrieval (for data already selected) then surely it is a question of disk ordering, memory caches, etc etc? (but i do not know how sqlite works, so perhaps there is some reason for people focusing on queries - can someone clarify?)Plainspoken
No it's not and it's part of my question why there is such difference in fetchall. If I would select 1000000 rows from a random table with SELECT * FROM random_table the fetchall would be a lot faster than selecting the 12000 rows now. Sadly I didn't keep complete track of everythin I tried, but there's been instances where the fetchall was a lot faster than the query. Anyway, the fact that the query only takes about a second while it has to compare a lot of rows indicates that the speed of the fetchall is definetly dependent on the query.Cohort
@Pegg can you put the R*Tree index in an answer? So far yours has been the fastest (although not fast enough).Cohort
W
8

The execution time geometrically proportional to the number of rows in each table rather than arithmetically e.g.

3 tables with 10 rows each => 1,000 comparision

3 tables with 10, 10 and 40 rows => 4,000 comparisons

3 tables with 20 rows each => 8,000 comparisons

You could probably re-factor the query to avoid some of the joins/cursors - when do you need an answer?

Could you do something like this:

SELECT precursor_id, feature_table_id 
FROM MSMS_precursor
INNER JOIN 

    (
        SELECT mzMin, mzMax, rtMin, rtMax, spectrum_id, feature_table_id, msrun_msrun_id

        FROM spectrum
        INNER JOIN 

           (select feature_table_id, mzMin, mzMax, rtMin, rtMax, msrun_msrun_id
            from feature
            where feature.msrun_msrun_id = 'value'
           ) subquery 

        ON subquery.msrun_msrun_id = spectrum.msrun_msrun_id
        WHERE 
            spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax 
    ) subquery

    ON subquery.spectrum_id = MSMS_precursor.spectrum_spectrum_id 

WHERE 
    MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax 

Using a subquery enables you to reduce the number of comparisons between the tables - you can quickly filter out the unwanted features, then the un-related spectra before searching for suitable precursors.

I don't use SQLLite - but the principle should still apply.

UPDATED : fixed bug in SQL

Notes:

You don't have to worry about the ANDs, you'll only get:

  • features where feature.msrun_msrun_id = 'value'
  • spectra for those features and where spectrum.scan_start_time BETWEEN subquery.rtMin AND subquery.rtMax
  • precursors for those spectrs and where MSMS_precursor.ion_mz BETWEEN subquery.mzMin AND subquery.mzMax

UPDATE 18/May:

It's the indexing!!! you have indexes on the search fields, but not on the fields participating in the joins - foreign key indices really boost performance:

CREATE INDEX `fk_msrun_msrun_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
CREATE INDEX `fk_spectrum_spectrum_id_feature` ON `feature` (`msrun_msrun_id` ASC); 
CREATE INDEX `fk_spectrum_spectrum_id_MSMS_precursor` ON `MSMS_precursor` (`spectrum_spectrum_id` ASC); 
Whip answered 16/5, 2012 at 10:49 Comment(5)
I'm having difficulty getting this to work. The mzMin, mzMax, rtMin and rtMax are all in the feature table, so I don't see how I can subquery on rtMin/rtMax or mzMin/mzMax. Since there is an AND between the two, if one fails it wouldn't check the other anyway, right?Cohort
I've fixed the select statements - added notesWhip
Did you try indexing the tables around the foreign keys?Whip
I can't try it till monday sadly.Cohort
Server was down so I couldn't test it, but the extra indexes did not help (they actually made it a little bit slower)Cohort
P
3

I suggest you try using an R*Tree index, They're designed for efficient range queries.


I haven't actually used R*Tree much, just read the documentation, but I think you might be using it incorrectly. You may want to try changing your query to use

WHERE convexhull_edges.rtMin <= spectrum.scan_start_time AND convexhull_edges.rtMax >= spectrum.scan_start_time AND
convexhull_edges.mzMin <= MSMS_precursor.ion_mz AND convexhull_edges.mzMax >= MSMS_precursor.ion_mz

which should be equivalent to your current query, but I think should be faster (You should be picking a range out of the R*Tree, rather than comparing the point to the range)

Pegg answered 18/5, 2012 at 21:0 Comment(0)
C
1

Consider using covering indices on the tables involved in your query.

You are indeed fetching a limited amount of columns in your select statement and the corresponding inner join and where clauses. By using a covering index with the columns well ordered in it, you should get a very fast query, that is you will remove the scan table in favor of an search table using a covering index.

Try to use those indices on your tables:

CREATE INDEX `fk_covering_feature` ON `feature` (`msrun_msrun_id`,`mzMin`,`mzMax`,`rtMin`,`rtMax`,`feature_table_id`);
CREATE INDEX `fk_covering_spectrum` ON  `spectrum` (`msrun_msrun_id`,`scan_start_time`,`spectrum_id`);
CREATE INDEX `fk_covering_MSMS_precursor` ON  `MSMS_precursor` (`spectrum_spectrum_id`,`ion_mz`,`precursor_id`);

When going for speed, you should also hint the query planner to understand msrun_msrun_id is a constant to check for both feature and spectrum tables. Add the constant test in your query by putting this additional test at the end of the query (and pass spectrumFeature_InputValues twice):

"AND spectrum.msrun_msrun_id = ?"
Cyclostyle answered 19/5, 2012 at 8:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.