Financial Data - R data.table - group by condiction

A

3

6

Given the following data.table with financial data:

userId  systemBankId    accountId   valueDate   quantity    description
871     0065            6422        2013-02-28  -52400      AMORTIZACION PRESTAMO       
871     0065            6422        2013-03-28  -52400  AMORTIZACION PRESTAMO   
871     0065            6422        2013-04-01  -3000000    AMORTIZACION PRESTAMO
871     0065            6422        2013-04-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-05-31  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-06-28  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-07-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-08-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-09-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-10-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-11-29  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2013-12-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-01-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-02-28  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-03-31  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-04-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-05-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-06-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-07-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-08-29  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-09-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-10-30  -52349  AMORTIZACION PRESTAMO   
871     0065            6422        2014-11-28  -52349  AMORTIZACION PRESTAMO

I want to group by userId, systemBankId, accountId, and quantity:

dt[userId==871L,.N,by=.(userId,systemBankId,accountId,quantity)]

The result is the following:

   userId systemBankId accountId quantity  N
   871         0065      6422   -52400     3
   871         0065      6422 -3000000     1
   871         0065      6422   -52349    20

But, the first and third are the same transaction: mortgage payment and second is a loan.

I want to group by in the following manner:

userId systemBankId accountId quantity N
   871         0065      6422   -XXXXX 23
   871         0065      6422 -3000000  1

So you can see that in 24 months this user has 23 mortgage transactions and 1 loan transaction payment.

The question is: Is there an easy manner to do it? (i.e):

dt[userId==871L,.N,by=.(userId,systemBankId,accountId,(quantity %between% c(-quantity*0.20,quantity*0,20 ))]

For payments between range [-20%,20%] are considered equals.

Thank you in advanced.

Best regards.

To get the data frame of the data above:

structure(list(userId = c(871L, 871L, 871L, 871L, 871L, 871L, 
871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 871L, 
871L, 871L, 871L, 871L, 871L, 871L), systemBankId = c(65L, 65L, 
65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L, 
65L, 65L, 65L, 65L, 65L, 65L, 65L, 65L), accountId = c(6422L, 
6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 
6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 6422L, 
6422L, 6422L, 6422L, 6422L), valueDate = structure(c(2L, 4L, 
1L, 10L, 23L, 5L, 14L, 16L, 17L, 19L, 8L, 21L, 9L, 3L, 22L, 11L, 
12L, 13L, 15L, 7L, 18L, 20L, 6L), .Label = c("01/04/2013", "28/02/2013", 
"28/02/2014", "28/03/2013", "28/06/2013", "28/11/2014", "29/08/2014", 
"29/11/2013", "30/01/2014", "30/04/2013", "30/04/2014", "30/05/2014", 
"30/06/2014", "30/07/2013", "30/07/2014", "30/08/2013", "30/09/2013", 
"30/09/2014", "30/10/2013", "30/10/2014", "30/12/2013", "31/03/2014", 
"31/05/2013"), class = "factor"), quantity = c(-52400L, -52400L, 
-3000000L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, 
-52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L, 
-52349L, -52349L, -52349L, -52349L, -52349L, -52349L, -52349L
), description = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "AMORTIZACION PRESTAMO", class = "factor")), .Names = c("userId", 
"systemBankId", "accountId", "valueDate", "quantity", "description"
), class = "data.frame", row.names = c(NA, -23L))

U P D A T E

The final step is to mark in the original dataset the transactions that are mortgage payment and that are loans payments.

Based on the answers I do:

a) criterio: in a 24 months time period,, if there are twenty or more recurrent transactions by userId, systemBankId, accountId, quantity (-20%,20%) they are mortgages payments:

tmp <- dt[userId==871L,.N,by=.(userId,systemBankId,accountId,round(quantity * 5, -floor(log10(abs(quantity))))/5)][N>20,list(userId,systemBankId,accountId,round,N)]

userId systemBankId accountId  round  N
871         0065      6422    -52000 23

I know there are 23 mortgage transactions.

b) I need to identify this 23 transactions:

tmp2 <- dt[userId==871L,list(userId,systemBankId,accountId,round=round(quantity * 5, -floor(log10(abs(quantity))))/5)]

merge(tmp,tmp2,by=c('userId','systemBankId','accountId','round'))

   userId systemBankId accountId  round  N
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
   871         0065      6422 -52000 23
userId systemBankId accountId  round  N

Ok I've identified the 23 transactions but if I have a transacction with quantity equal to -52000 this will be marked like it was a mortgage payment too.

My question is: based on the criterio of recurrent payment How I identify the correct transactions.

Thx in advanced.

Arakawa answered 10/3, 2015 at 15:15 Comment(3)

In your example, you only have 2 -52400 quantity, not 3. – Borstal 10/3, 2015 at 15:32

Something like

DT[userId == 871L, .N, by = .(userId, systemBankId, accountId, quantity = round(quantity * 5, -floor(log10(abs(quantity))))/5)]

? – Helio 10/3, 2015 at 15:49

@Helio you ought to post that as an answer – Synergist 10/3, 2015 at 16:0

E

4

Perhaps smth along these lines:

dt[, round.qty := quantity[1] * round(quantity/quantity[1]), by = .(userId, systemBankId, accountId)]

dt[, .N, by = .(userId, systemBankId, accountId, round.qty)]
#   userId systemBankId accountId round.qty  N
#1:    871           65      6422    -52400 22
#2:    871           65      6422  -2986800  1

Emancipator answered 10/3, 2015 at 16:6 Comment(0)

O

4

Here's a quick hack with dplyr:

library(dplyr)
setDF(dt) %>% mutate(quantity =  round(quantity/10000, 0)) %>%
  group_by(userId, systemBankId, accountId, quantity) %>% tally()

Which gives:

#Source: local data frame [2 x 5]
#Groups: userId, systemBankId, accountId
#
#  userId systemBankId accountId quantity  n
#1    871           65      6422     -300  1
#2    871           65      6422       -5 22

Edit

As mentionned by David in the comments, this answer is an oversimplification. A more consistent approach would be something like Roland's suggestion:

library(dplyr)
setDF(dt) %>% 
  mutate(quantity = round(quantity * 5, -floor(log10(abs(quantity))))/5) %>%
  group_by(userId, systemBankId, accountId, quantity) %>% tally()

Or using data.table:

dt[userId == 871L, .N, by = .(userId, systemBankId, accountId, quantity = round(quantity * 5, -floor(log10(abs(quantity))))/5)]

Overturn answered 10/3, 2015 at 15:44 Comment(3)

Not sure why couldn't just slightly modify OPs codes to dt[userId==871L,.N,by=.(userId,systemBankId,accountId, quantity = round(quantity/10000, 0))], why dplyr? – Rickrickard 10/3, 2015 at 15:48

What about the fact that [-20%,20%] are considered equals – Epigene 10/3, 2015 at 15:54

This answer is also conceptually wrong because you are assuming that the morgatge payments are in 10 thousands. Try dt[, quantity := quantity*100] and then run your code again and see what happens. – Rickrickard 10/3, 2015 at 15:55

E

4

Perhaps smth along these lines:

dt[, round.qty := quantity[1] * round(quantity/quantity[1]), by = .(userId, systemBankId, accountId)]

dt[, .N, by = .(userId, systemBankId, accountId, round.qty)]
#   userId systemBankId accountId round.qty  N
#1:    871           65      6422    -52400 22
#2:    871           65      6422  -2986800  1

Emancipator answered 10/3, 2015 at 16:6 Comment(0)

T

0

Here is a clever function that is relevant created by @dnlbrky here: Use a value from the previous row in an R data.table calculation

#Create a function to return previous rows 
   rowShift <- function(x, shiftLen = 1L) {
     r <- (1L + shiftLen):(length(x) + shiftLen)
     r[r<1] <- NA
     return(x[r])
   }

This will let you determine your 20% ranges:

dt$prev_qty_low  <-rowShift(dt$quantity,-1) * .8
dt$prev_qty_high <-rowShift(dt$quantity,-1) * 1.2

Terryl answered 10/3, 2015 at 16:4 Comment(0)

Recommended topics

Hot tags