BugZilla, best (fastest) way to get bug status (etc) on a given date

Asked 28/7, 2013 at 5:46 Answered 12/9, 2013 at 21:33

I'm working on an application that takes data from various sources and generates reports. Currently I'm changing it to do reports based on a the data from a given date in history, previously it only showed data from the state of things today.

One of my data sources is Bugzilla, so I need to get the Bugzilla data for a given date in history. I have a read-only connection to the Bugzilla database but no easy way to do anything else to the server (like install plug-ins, or put procedures in the database). Also the connection between the report server and the Bugzilla server is slow, so I'd like to do the calculations on the server rather than fetch the data and work things out on the reports server.

I actually have this working at a mostly acceptable speed, but I'm not sure if I'm doing it the best or 'right' way, I'm concerned that the speed might cease to be acceptable as we add more issues to the database.

So, my solution is below -- how would you do it.

For a bit of background, Bugzilla stores the current state of all bugs in a table (called 'bugs') and a history of the changes to each field in a table ('bugs_activity') that looks something like this:

fieldid   INTEGER,   -- References the fielddefs table
bug_when  TIMESTAMP, -- Time the change happend
added     TEXT,      -- New text for the field
removed   TEXT,      -- Old text for the field

The Bugzilla database is MySQL. I think the right way to do it is either with a stored procedure or a temporary table, but I don't have either option available to me. I know there are also reporting tools for Bugzilla but I don't have access to install them, also the reports I'm generating tie up data from other sources (and have specific formatting).

There's a local PostgreSQL database on the reports server so I could just periodically mirror all the data over to there, but I really don't want to do that as it seems a bit wasteful to store identical data in two places.

My solution is to build a table in a subselect that looks like the normal bugs table (for the data I'm interested in for a given report) and then use this select as the source to the normal select that works the same as query for the reports based on today's data.

SELECT bug_status, bug_id, op_sys, resolution, rep_platform   
  FROM (SELECT bug_id, 
        IFNULL((SELECT removed FROM bugs_activity a, fielddefs f  
                 WHERE a.fieldid = f.id   
                   AND bug_id = b.bug_id AND f.name = 'bug_status' 
                   AND bug_when >= '2012-01-01 00:00:00'  
                 ORDER BY bug_when DESC LIMIT 1), bug_status) AS bug_status,
    -- Repeat IFNULL clause for op_sys, resolution and rep_platform
        FROM bugs b 
        WHERE b.creation_ts <= '2012-01-01 00:00:00' ) bug_subselect
        -- Some other filters to reduce the bugs (specific product, ect)
      )
    -- More filters based on the new values that have been derived
     ;

Then I use that as an input to a select that counts the different statuses, etc.

This query turns out to be way too slow, I'm assuming because it's getting the entire results for the inner selects so it can order then and give me the top one.

I did try doing it by LEFT JOINing the bugs_activity table onto the bugs table several times and then doing the IFNULL queries on the results, that was fast but a little complex to maintain in the generation code so adapted it to this:

SELECT bug_status, bug_id, op_sys, resolution, rep_platform   
  FROM (SELECT bug_id,
    IFNULL((SELECT removed FROM bugs_activity a, fielddefs f 
             WHERE a.fieldid = f.id AND bug_id = b.bug_id AND f.name = 'bug_status'
               AND bug_when = (
                     SELECT MIN(bug_when) FROM bugs_activity a, fielddefs f 
                      WHERE a.fieldid = f.id 
                            AND bug_id = b.bug_id 
                            AND f.name = 'bug_status'
                        AND bug_when >= '2012-01-01 00:00:00' 
                          LIMIT 1 
                        )
             LIMIT 1), bug_status) AS bug_status,
        -- Repeat IFNULL clause for op_sys, resolution and rep_platform
        FROM bugs b 
        WHERE b.creation_ts <= '2012-01-01 00:00:00' ) bug_subselect
        -- Some other filters to reduce the bugs (specific product, ect)
      )
    -- More filters based on the new values that have been derived
      ;

You need both LIMIT 1's in there (I think) as some fields have managed to have two changes on the same timestamp (either a database glitch, maybe from an upgrade, or two users editing the same bug -- I'm not sure, I just know that it's in there and I need to deal with it).

This runs in around 3 seconds with no filters to reduce the bug list (which is the worst case and will almost never happen), and it runs faster with filters. The LEFT JOIN version runs in roughly the same speed (slightly slower) so I went with the one above. It's OK for the moment, but I can see it getting slow in the future -- I'll add a loading message in the GUI and there's already a message saying these reports can take longer to generate, I'm just wondering if I'm missing some trick to make it faster.

Transpolar answered 28/7, 2013 at 5:46 Comment(8)

" I'm just wondering if I'm missing some trick to make it faster" - yep. They are called indexes... – Agonized 28/7, 2013 at 5:47

The bugs_activity table doesn't have an index, or did you mean something else. – Transpolar 28/7, 2013 at 5:48

I meant exactly what I said - indexes speed up queries. – Agonized 28/7, 2013 at 5:50

I cannot change the database, so I can't create indexes on the tables if they're not already there -- is there an index I'm missing that I should be using. Or are you saying that I'd be better off trying to get access to add the index. – Transpolar 28/7, 2013 at 6:3

The best way is to get someone to create an index. The MySQL database should be accessible to you in order for you to do this so you know who has access. Ask them. – Ringsmuth 7/9, 2013 at 12:55

@Ringsmuth Yeah, getting an change approved and implemented on the production server is going to be non-trivial for various reasons but you're right, I should kick off the process. I guess what I'm asking is 'have I got as far as I can without the index?'. – Transpolar 8/9, 2013 at 12:43

you say you tried left joining but the code generation was hard to maintain? is that due to a perception that you would have to put where clauses in multiple places? – Spiro 10/9, 2013 at 9:42

@gordatron It's because the query is generated from PHP (from various places because some parts use just a bit of the query) and it ended up with code being duplicated or spread around. Since my final solution was just as fast I went with the way that may more logic sense in the query generator. – Transpolar 10/9, 2013 at 9:56

If I am getting you correctly you could try this..

SET @tdate = '2012-01-01 00:00:00';

SELECT  
  b.bug_id
  ,CASE 
    WHEN s.removed IS NULL THEN b.bug_status
    ELSE s.removed
  END AS statusAtDate
  ,CASE 
    WHEN o.removed IS NULL THEN b.op_sys
    ELSE o.removed
  END AS apSysAtDate
FROM
  bugs AS b 
  LEFT OUTER JOIN (
    SELECT 
      a.bug_id
      ,a.bug_when
      ,a.removed
      ,a.bug_when
      ,@row_num := IF(@last=a.bug_id,@row_num+1,1) AS rnk
      ,@last:=a.bug_id
    FROM 
      bug_activity AS a
      INNER JOIN fielddefs AS f
        ON a.fieldid = f.id
          AND f.name = 'bug_status'
    WHERE
        a.bug_when <= @tdate
    ORDER BY 
      a.bug_id
      ,a.bug_when
    ) AS s
      ON b.bug_id = s.bug_id
      AND s.rnk=1
  LEFT OUTER JOIN (
    SELECT 
      a.bug_id
      ,a.bug_when
      ,a.removed
      ,a.bug_when
      ,@row_num := IF(@last=a.bug_id,@row_num+1,1) AS rnk
      ,@last:=a.bug_id
    FROM 
      bug_activity AS a
      INNER JOIN fielddefs AS f
        ON a.fieldid = f.id
          AND f.name = 'op_sys'
    WHERE
        a.bug_when <= @tdate
    ORDER BY 
      a.bug_id
      ,a.bug_when
    ) AS o
      ON b.bug_id = o.bug_id
      AND o.rnk=1

--repeat for resolution and rep_platform

sorry I don't have a db here to verify the code so sorry if there is typos or similar..

I don't know if that's how you were doing the left outer join before but does that help/work if you use a session variable for re-use?

not sure if this will be of any assistance at all seeing as you said that your left outer join method was running at the same speed anyway.. maybe mysql query optimiser can figure a better way of doing it without this :/

I am no optimisation expert by the way (far from it).. just saying what i would try other than the good suggestion to get some indexes on the go.

EDIT:

Another thing you could try.. I think this should work..

SELECT
  bug_id
  ,bug_status
  ,op_sys
  ,max(old_status)
  ,max(old_opSys)
(
SELECT  
  *
FROM
  bugs AS b 
  LEFT OUTER JOIN (
    SELECT 
      a.bug_id
      ,a.bug_when
      ,if(f.name = 'bug_status',a.removed,NULL) AS old_status
      ,if(f.name = 'op_sys',a.removed,NULL) AS old_opSys
      ,a.bug_when
      ,@row_num := IF(@last=a.bug_id AND@lastField=f.name ,@row_num+1,1) AS rnk
      ,@last:=a.bug_id
      ,@lastField:=f.name
    FROM 
      bug_activity AS a
      INNER JOIN fielddefs AS f
        ON a.fieldid = f.id

    WHERE
        a.bug_when <= '2012-01-01 00:00:00'
        AND f.name in( 'bug_status','op_sys')
    ORDER BY 
      a.bug_id
      ,f.name
      ,a.bug_when
    ) AS s
      ON b.bug_id = s.bug_id
      AND s.rnk=1
) AS T
  GROUP BY
    bug_id
    ,bug_status
    ,op_sys

I have left out the case or if statement from the outer select here.. I was thinking no matter which solution it might be worth testing how it runs doing the final checks in code rather than the DB? even if that works you might not opt for it but it might be worth checking.

as in something like:

<%= row->old_status ?: row->bug_status %>

(sorry if my PHP is off.. not really used it much)

seems like it should work? http://sqlfiddle.com/#!2/eff8c/1

Spiro answered 10/9, 2013 at 10:24 Comment(5)

you might be able to do something funky where you join to the activity table once for all fields and change the rnk column every time the bug_id or fieldname changes. then wrap the whole lot in a group by bug_id and max on every other column.. don't knwo if it would be any faster and cant remember if you can max text fields like that in mysql. – Spiro 10/9, 2013 at 10:38

Thanks for taking the time to answer. It's an interesting approach, and I'm going to have to fiddle with it a bit. But currently it's about six times slower than my approach. Interestingly, if @tdate is NULL (because I forgot to set it), it's much much faster, I guess the server is doing some clever optimization (or because no results are returned internally). I'll do some test and profiling, it's given me a few more avenues to explore at least. :) – Transpolar 11/9, 2013 at 4:33

Such a big speed difference! hmm you might try swapping the case statements out for the ifs that you have.. I only used case as they are my default option (due to their increases flexibility and readability) but maybe they are significantly slower than if in this case? also the row_number part might be quite slow, maybe would be worth doing only once. – Spiro 11/9, 2013 at 8:8

added a different way of doing it.. not sure it will be any better and it might not work! also if you are getting indexes set up it might be worth saving copies of a few different working version to test again after the index has been created – Spiro 11/9, 2013 at 8:34

Thanks, it works (with some tweaking) but it's still slow. I've done some investigation and it's definitely the JOIN that's slow, the inner query is fast enough but returns 26,000+ rows and there's 10,000 ish in the bug table. The connection between the servers isn't great so doing work in the app isn't ideal, but I think I can maybe take a hybrid of the ideas and make something the works faster. I'll have to spend some time thinking about it -- but thanks, what you've posted gives me many more options to play with. – Transpolar 11/9, 2013 at 10:24

I suggest using the Bugzilla REST API interface rather than accessing the Bugzilla DB directly. Here is an example API to retrieve bugs created on a particular date.

https://api-dev.bugzilla.mozilla.org/test/1.3/bug?creation_date=2008-03-31

References:
https://wiki.mozilla.org/Bugzilla:REST_API
https://wiki.mozilla.org/Bugzilla:REST_API:Objects
https://wiki.mozilla.org/Bugzilla:REST_API:Search

Blocker answered 12/9, 2013 at 21:33 Comment(2)

I did look at the REST API but I read that it was quite buggy (although that might just be on our older version, we're updating soon). Also it seems quite difficult to ask 'What was the status of bug 1234 on 2012-01-01' it still looks like I would have to download and parse the change sets. I guess there's probably a complex boolean query that would do it. I'll give it a test when I get a chance, but it looks like it'll be transferring too much information over the slow link to be faster than the SQL query (since I do COUNTs on the subtable and just return raw totals). – Transpolar 13/9, 2013 at 0:27

Where did you read that the REST API was quite buggy? As its maintainer, I can assure you that the current versions are highly stable, and used for hundreds of thousands of queries a week by the Mozilla project. It is soon to be replaced by a native REST API which will have the same interface (with a compatibility shim) but be much faster and have fewer limitations on the data returned. – Yeargain 25/2, 2014 at 10:1

Recommended topics

Hot tags