Aggregating hourly data into daily aggregates
Asked Answered
H

5

18

I have an hourly weather data in the following format:

Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
...
...
12/31/2000 23:00,25

What I need is a daily aggregate of max, min, ave like this:

Date,MaxDBT,MinDBT,AveDBT
01/01/2000,36,23,28
01/02/2000,34,22,29
01/03/2000,32,25,30
...
...
12/31/2000,35,9,20

How to do this in R?

Henna answered 4/3, 2011 at 23:30 Comment(2)
next time, provide the data tables in a readible format, eg using dput() or giving runnable code that provides the data as I showed in my answer.Curvilinear
Having read the proposed answers, I wonder whether it cannot be done using base R only.Wooldridge
K
22

1) This can be done compactly using zoo:

L <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"

library(zoo)
stat <- function(x) c(min = min(x), max = max(x), mean = mean(x))
z <- read.zoo(text = L, header = TRUE, sep = ",", format = "%m/%d/%Y", aggregate = stat)

This gives:

> z
           min max     mean
2000-01-01  30  33 31.33333
2000-12-31  25  25 25.00000

2) here is a solution that only uses core R:

DF <- read.csv(text = L)
DF$Date <- as.Date(DF$Date, "%m/%d/%Y")
ag <- aggregate(DBT ~ Date, DF, stat) # same stat as in zoo solution 

The last line gives:

> ag
        Date  DBT.min  DBT.max DBT.mean
1 2000-01-01 30.00000 33.00000 31.33333
2 2000-12-31 25.00000 25.00000 25.00000

EDIT: (1) Since this first appeared the text= argument to read.zoo was added in the zoo package. (2) minor improvements.

Kadi answered 5/3, 2011 at 1:15 Comment(3)
@Grothendieck : +1 I never realized you could just drop the hourly information by not specifying it in the as.Date().Curvilinear
Im looking at this from the phone but Id bet that the aggregate solution gives a list column which you cant really use for anything unless youll wrap the whole thing into do.call(cbind.data.frame, aggregate...Peculiarity
It produces a data.frame whose first column is ag$Date. and whose second column is the matrix ag$DBT` whose columns are "min", "max" and "mean". It readily converts to a zoo series using read.zoo(ag).Kadi
C
6

Using strptime(), trunc() and ddply() from the plyr package :

#Make the data
ZZ <- textConnection("Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25")
dataframe <- read.csv(ZZ,header=T)
close(ZZ)

# Do the calculations
dataframe$Date <- strptime(dataframe$Date,format="%m/%d/%Y %H:%M")
dataframe$day <- trunc(dataframe$Date,"day")

require(plyr)

ddply(dataframe,.(day),
      summarize,
      aveDBT=mean(DBT),
      maxDBT=max(DBT),
      minDBT=min(DBT)
)

gives

         day   aveDBT maxDBT minDBT
1 2000-01-01 31.33333     33     30
2 2000-12-31 25.00000     25     25

To clarify :

strptime converts the character to dates according to the format. To see how you can specify the format, see ?strptime. trunc will then truncate these date-times to the specified unit, which is day in this case.

ddply will evaluate the function summarize within the dataframe after splitting it up according to day. everything after summarize are arguments that are passed to the function summarize.

Curvilinear answered 4/3, 2011 at 23:44 Comment(2)
Do you really need the wrapper around summarize ?Eldon
@Sacha : indeed, no need for that, thx for the correction. It's getting late here...Curvilinear
P
2

There is also a nice package called hydroTSM. It uses zoo objects and can convert to other aggregates in time

The function in your case is subdaily2daily. You can choose if the aggregation should be based on min / max / mean...

Picaresque answered 15/4, 2014 at 8:42 Comment(0)
N
1

A couple of options:

1. Timetk

If you have a data frame (or tibble) then the summarize_by_time() function from timetk can be used:

library(tidyverse)
library(timetk)

# Collect Data
text <- "Date,DBT
01/01/2000 01:00,30
01/01/2000 02:00,31
01/01/2000 03:00,33
12/31/2000 23:00,25"

df <- read_csv(text, col_types = cols(Date = col_datetime("%m/%d/%Y %H:%M")))
df
#> # A tibble: 4 x 2
#>   Date                  DBT
#>   <dttm>              <dbl>
#> 1 2000-01-01 01:00:00    30
#> 2 2000-01-01 02:00:00    31
#> 3 2000-01-01 03:00:00    33
#> 4 2000-12-31 23:00:00    25

# Summarize
df %>%
  summarise_by_time(
    .date_var = Date, 
    .by       = "day",
    min       = min(DBT),
    max       = max(DBT),
    mean      = mean(DBT)
  )
#> # A tibble: 2 x 4
#>   Date                  min   max  mean
#>   <dttm>              <dbl> <dbl> <dbl>
#> 1 2000-01-01 00:00:00    30    33  31.3
#> 2 2000-12-31 00:00:00    25    25  25

Created on 2021-05-21 by the reprex package (v2.0.0)

2. Tidyquant

You can use the tidyquant package for this. The process is involves using the tq_transmute function to return a data frame that is modified using the xts aggregation function, apply.daily. We'll apply a custom stat_fun, which returns the min, max and mean. However, you can apply any vector function you'd like such as quantile.


library(tidyquant)

df
#> # A tibble: 4 x 2
#>                  Date   DBT
#>                <dttm> <dbl>
#> 1 2000-01-01 01:00:00    30
#> 2 2000-01-01 02:00:00    31
#> 3 2000-01-01 03:00:00    33
#> 4 2000-12-31 23:00:00    25

stat_fun <- function(x) c(min = min(x), max = max(x), mean = mean(x))

df %>%
    tq_transmute(select     = DBT,
                 mutate_fun = apply.daily,
                 FUN        = stat_fun)
# A tibble: 2 x 4
#>                 Date   min   max     mean
#>                <dttm> <dbl> <dbl>    <dbl>
#> 1 2000-01-01 03:00:00    30    33 31.33333
#> 2 2000-12-31 23:00:00    25    25 25.00000
Neurovascular answered 26/6, 2017 at 21:3 Comment(2)
Can you please let me know what needs to be one if i need to aggregate at the hourly levelBullate
For hourly, you can use the timetk::summarize_by_time(.by = "hour")Neurovascular
S
0

Given that you have POSIXct time format, you can do this using as.POSIXct(time), all you need is cut and aggregate().

try this:

split_hour = cut(as.POSIXct(temp$time), breaks = "60 mins") # summrise given mins
temp$hour = split_hour # make hourly vaiable
ag = aggregate(. ~ hour, temp, mean)

In this case, temp is like this temp

1  0.6 0.6 0.0 0.350 0.382 0.000 2020-04-13 18:30:42
2  0.0 0.5 0.5 0.000 0.304 0.292 2020-04-13 19:56:02
3  0.0 0.2 0.2 0.000 0.107 0.113 2020-04-13 20:09:10
4  0.6 0.0 0.6 0.356 0.000 0.376 2020-04-13 20:11:57
5  0.0 0.3 0.2 0.000 0.156 0.148 2020-04-13 20:12:07
6  0.0 0.4 0.4 0.000 0.218 0.210 2020-04-13 22:02:49
7  0.2 0.2 0.0 0.112 0.113 0.000 2020-04-13 22:31:43
8  0.3 0.0 0.3 0.155 0.000 0.168 2020-04-14 03:19:03
9  0.4 0.0 0.4 0.219 0.000 0.258 2020-04-14 03:55:58
10 0.2 0.0 0.0 0.118 0.000 0.000 2020-04-14 04:25:25
11 0.3 0.3 0.0 0.153 0.160 0.000 2020-04-14 05:38:20
12 0.0 0.7 0.8 0.000 0.436 0.493 2020-04-14 05:40:02
13 0.0 0.0 0.2 0.000 0.000 0.101 2020-04-14 05:40:44
14 0.3 0.0 0.3 0.195 0.000 0.198 2020-04-14 06:09:26
15 0.2 0.2 0.0 0.130 0.128 0.000 2020-04-14 06:17:15
16 0.2 0.0 0.0 0.144 0.000 0.000 2020-04-14 06:19:36
17 0.3 0.0 0.4 0.177 0.000 0.220 2020-04-14 06:23:43
18 0.2 0.0 0.0 0.110 0.000 0.000 2020-04-14 06:25:19
19 0.0 0.0 0.0 1.199 1.035 0.251 2020-04-14 07:05:24
20 0.2 0.2 0.0 0.125 0.107 0.000 2020-04-14 07:21:46

ag is like this

ag

1  2020-04-13 18:30:00 0.60000000 0.6000000 0.0000000 0.3500000 0.38200000 0.00000000
2  2020-04-13 19:30:00 0.15000000 0.2500000 0.3750000 0.0890000 0.14175000 0.23225000
3  2020-04-13 21:30:00 0.00000000 0.4000000 0.4000000 0.0000000 0.21800000 0.21000000
4  2020-04-13 22:30:00 0.20000000 0.2000000 0.0000000 0.1120000 0.11300000 0.00000000
5  2020-04-14 02:30:00 0.30000000 0.0000000 0.3000000 0.1550000 0.00000000 0.16800000
6  2020-04-14 03:30:00 0.30000000 0.0000000 0.2000000 0.1685000 0.00000000 0.12900000
7  2020-04-14 05:30:00 0.18750000 0.1500000 0.2125000 0.1136250 0.09050000 0.12650000
8  2020-04-14 06:30:00 0.10000000 0.1000000 0.0000000 0.6620000 0.57100000 0.12550000
9  2020-04-14 07:30:00 0.00000000 0.3000000 0.2000000 0.0000000 0.16200000 0.11800000
10 2020-04-14 19:30:00 0.20000000 0.3000000 0.0000000 0.1460000 0.19000000 0.00000000
11 2020-04-14 20:30:00 0.06666667 0.2000000 0.2666667 0.0380000 0.11766667 0.17366667
12 2020-04-14 22:30:00 0.20000000 0.3000000 0.0000000 0.1353333 0.18533333 0.00000000
13 2020-04-14 23:30:00 0.00000000 0.5000000 0.5000000 0.0000000 0.28000000 0.32100000
14 2020-04-15 01:30:00 0.25000000 0.2000000 0.4500000 0.1355000 0.11450000 0.26100000
Statesman answered 28/4, 2020 at 9:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.