I'm working on an application that takes data from various sources and generates reports. Currently I'm changing it to do reports based on a the data from a given date in history, previously it only showed data from the state of things today.
One of my data sources is Bugzilla, so I need to get the Bugzilla data for a given date in history. I have a read-only connection to the Bugzilla database but no easy way to do anything else to the server (like install plug-ins, or put procedures in the database). Also the connection between the report server and the Bugzilla server is slow, so I'd like to do the calculations on the server rather than fetch the data and work things out on the reports server.
I actually have this working at a mostly acceptable speed, but I'm not sure if I'm doing it the best or 'right' way, I'm concerned that the speed might cease to be acceptable as we add more issues to the database.
So, my solution is below -- how would you do it.
For a bit of background, Bugzilla stores the current state of all bugs in a table (called 'bugs') and a history of the changes to each field in a table ('bugs_activity') that looks something like this:
fieldid INTEGER, -- References the fielddefs table
bug_when TIMESTAMP, -- Time the change happend
added TEXT, -- New text for the field
removed TEXT, -- Old text for the field
The Bugzilla database is MySQL. I think the right way to do it is either with a stored procedure or a temporary table, but I don't have either option available to me. I know there are also reporting tools for Bugzilla but I don't have access to install them, also the reports I'm generating tie up data from other sources (and have specific formatting).
There's a local PostgreSQL database on the reports server so I could just periodically mirror all the data over to there, but I really don't want to do that as it seems a bit wasteful to store identical data in two places.
My solution is to build a table in a subselect that looks like the normal bugs table (for the data I'm interested in for a given report) and then use this select as the source to the normal select that works the same as query for the reports based on today's data.
SELECT bug_status, bug_id, op_sys, resolution, rep_platform
FROM (SELECT bug_id,
IFNULL((SELECT removed FROM bugs_activity a, fielddefs f
WHERE a.fieldid = f.id
AND bug_id = b.bug_id AND f.name = 'bug_status'
AND bug_when >= '2012-01-01 00:00:00'
ORDER BY bug_when DESC LIMIT 1), bug_status) AS bug_status,
-- Repeat IFNULL clause for op_sys, resolution and rep_platform
FROM bugs b
WHERE b.creation_ts <= '2012-01-01 00:00:00' ) bug_subselect
-- Some other filters to reduce the bugs (specific product, ect)
)
-- More filters based on the new values that have been derived
;
Then I use that as an input to a select that counts the different statuses, etc.
This query turns out to be way too slow, I'm assuming because it's getting the entire results for the inner selects so it can order then and give me the top one.
I did try doing it by LEFT JOINing the bugs_activity table onto the bugs table several times and then doing the IFNULL queries on the results, that was fast but a little complex to maintain in the generation code so adapted it to this:
SELECT bug_status, bug_id, op_sys, resolution, rep_platform
FROM (SELECT bug_id,
IFNULL((SELECT removed FROM bugs_activity a, fielddefs f
WHERE a.fieldid = f.id AND bug_id = b.bug_id AND f.name = 'bug_status'
AND bug_when = (
SELECT MIN(bug_when) FROM bugs_activity a, fielddefs f
WHERE a.fieldid = f.id
AND bug_id = b.bug_id
AND f.name = 'bug_status'
AND bug_when >= '2012-01-01 00:00:00'
LIMIT 1
)
LIMIT 1), bug_status) AS bug_status,
-- Repeat IFNULL clause for op_sys, resolution and rep_platform
FROM bugs b
WHERE b.creation_ts <= '2012-01-01 00:00:00' ) bug_subselect
-- Some other filters to reduce the bugs (specific product, ect)
)
-- More filters based on the new values that have been derived
;
You need both LIMIT 1's in there (I think) as some fields have managed to have two changes on the same timestamp (either a database glitch, maybe from an upgrade, or two users editing the same bug -- I'm not sure, I just know that it's in there and I need to deal with it).
This runs in around 3 seconds with no filters to reduce the bug list (which is the worst case and will almost never happen), and it runs faster with filters. The LEFT JOIN version runs in roughly the same speed (slightly slower) so I went with the one above. It's OK for the moment, but I can see it getting slow in the future -- I'll add a loading message in the GUI and there's already a message saying these reports can take longer to generate, I'm just wondering if I'm missing some trick to make it faster.