Mysql create freqency distribution

Asked 24/2, 2013 at 19:18 Answered 21/6, 2019 at 3:37

Solved mysql count distribution frequency

I have a simple table BIRDCOUNT below, showing how many birds were counted on any given day:

+----------+
| NUMBIRDS |
+----------+
| 123      |
| 573      |
| 3        |
| 234      |
+----------+

I would like to create a frequency distribution graph, showing how many times a number of birds were counted. So I need MySQL to create something like:

+------------+-------------+
| BIRD_COUNT | TIMES_SEEN  |
+------------+-------------+
| 0-99       | 17          |
| 100-299    | 23          |
| 200-399    | 12          |
| 300-499    | 122         |
| 400-599    | 3           |
+------------+-------------+

If the bird count ranges were fixed this would be easy. However, I never know the min/max of how many birds were seen. So I need a select statement that:

Creates an output similar to above, always creating 10 ranges of counts.
(more advanced) Creates output similar to above, always creating N ranges of counts.

I don't know if #2 is possible in a single select but can anyone solve #1?

Corsage answered 24/2, 2013 at 19:18 Comment(5)

So far I found the min & max in one select, and then divided into N ranges, constructed a SELECT statement in PHP creating the ranges programatically, then running the second select. That doesn't really help anyone come up with an answer, but since you asked. – Corsage 24/2, 2013 at 19:28

can you give same sample data and what result you want? – Delija 24/2, 2013 at 19:29

and if the bird_count = 200 , to which row it will be ? – Delija 24/2, 2013 at 19:31

@Michelle Do you mean that on 17 days between 0 & 99 birds were counted and on another 23 days between 100 and 199 birds were counted etc? – Leto 24/2, 2013 at 20:7

Yes - I've updated the question to reflect non-overlapping ranges – Corsage 25/2, 2013 at 2:29

SELECT
    FLOOR( birds.bird_count / stat.diff ) * stat.diff as range_start, 
    (FLOOR( birds.bird_count / stat.diff ) +1) * stat.diff -1 as range_end, 
    count( birds.bird_count ) as times_seen
FROM birds_table birds, 
    (SELECT 
        ROUND((MAX( bird_count ) - MIN( bird_count ))/10) AS diff
    FROM birds_table
    ) AS stat
GROUP BY FLOOR( birds.bird_count / stat.diff )

Here You have answer for both of Your questions ;] with difference that start and end of range are in separate columns instead of concatenated but if You need it in one column I guess You can do it from here. To change number of ranges just edit number 10 You can find in sub-query.

Hornmad answered 24/2, 2013 at 21:52 Comment(5)

This looks promising (not tested yet). I assume that GROUP will force the count to total for each range (like a WHERE >=range_start and <=range_end)... – Corsage 25/2, 2013 at 2:42

What happens if there are no birds counted in any single range? I assume that it will not create a row of output (i.e. no row with a bird count of 0). – Corsage 25/2, 2013 at 16:26

It will not. Thinking about it You could try with outer join, i will edit answer with example in few minutes. It will be difficult to use only sql if You require dynamic number of ranges. – Hornmad 25/2, 2013 at 19:15

After few attempts i didn't find a solution for this issue, i think that is because bird_count column is used to calculate ranges and group by, so if there are no rows for a range it can't be calculated. You are left with checking for empty ranges in code or use fixed ranges. – Hornmad 25/2, 2013 at 20:39

I'll do a second select then - this seems to work well. Thanks – Corsage 25/2, 2013 at 23:52

When creating something like this, GROUP BY, is your friend. The basic idea is to put each value into a bucket, and then count the number of elements in each bucket. To create a bucket, you define a function that takes the value and compute a unique value for the bucket.

Something like this:

SELECT
  @low := TRUNCATE(bird_count/100, 0) * 100 as Low,
  TRUNCATE(@low + 99, 0) as High,
  COUNT(*) AS Count
FROM birds_seen
GROUP BY Low;

In this case, you define a function that take the bird count, and compute the lower range of the bucket. You then group all the values on the lower range, which will place, for example, 123 and 145 into the bucket labelled "100", and 234 and 246 into the bucket labelled "200".

Now, each value is placed in a bucket, and you can group the values by the bucket label, and count the number of elements in each bucket.

Lithea answered 24/2, 2013 at 20:43 Comment(2)

You don't know the highest & lowest number of birds, so your solution will created a large number of ranges (each of size 100). – Corsage 25/2, 2013 at 2:36

Not really, it will just create buckets for values that are actually in the table. – Lithea 26/2, 2013 at 10:10

I'm guess at your actual SQL query:

SELECT dateColumn, COUNT(*) AS NUMBIRDS
FROM birdTable
GROUP BY dateColumn

If so, all you have to do is "bin" your counts:

SELECT CONCAT_WS('-', 
   FLOOR( NUMBIRDS/100 )*100,
   ((FLOOR( NUMBIRDS/100 )+1)*100) - 1
) AS BIRD_COUNT
,COUNT(*) AS TIMES_SEEN
FROM (
    SELECT dateColumn, COUNT(*) AS NUMBIRDS
    FROM birdTable
    GROUP BY dateColumn
) AS birdCounts
GROUP BY BIRD_COUNT

Granted, if one of the ranges is missing, you'll won't get a matching row -- but you can easily solve that with a LEFT JOIN if that's an issue.

Calandracalandria answered 24/2, 2013 at 20:41 Comment(0)

Building upon @gustek answer and Wikipedia's Histogram page, here are a couple of solutions using Scott's Rule and Rice Rule to dynamically set the bin width h using the formula for number of bins, $k = \ceil{ (max - min) / h }$

# Histogram generator using Scott's rule, width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
       count(r2.value)                                      as times_seen,
FROM RESULT r2,
 (
     select 3.49 * stddev(r.value) / (power(count(*), 1 / 3)) as width
     from RESULT r
 ) as stat
GROUP BY FLOOR(r2.value / stat.width);

# Histogram using Rice rule k = ceil(2*n^1/3), width(h) = (max - min) / k
SELECT any_value(FLOOR(r2.value / stat.width) * stat.width) as range_start,
       count(r2.value)                                      as times_seen,
FROM RESULT r2,
 (
     select (max(r.value) - min(r.value)) / ceil(2 * power(count(*), 1 / 3)) as width
     from RESULT r
 ) as stat
GROUP BY FLOOR(r2.value / stat.width);

The any_value() function is used to work around new MySQL ONLY_FULL_GROUP_BY issue.

Lucais answered 21/6, 2019 at 3:37 Comment(0)

Recommended topics

Hot tags