Convert FIX message format ("Tag=Value") into CSV
Asked Answered
N

2

5

I have a csv/log file of 35=S (Quote messages; "Tag=Value") and I need to extract the rates into a proper CSV file for data mining. This is not strictly FIX related, it's more of a R related question on how to clean a dataset.

The raw messages look something like this:

190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,

I need first to get to an intermediate data set that looks like this, where the same tags are aligned.

190=1.1204 ,191=-0.000029,193=20141008,537=0,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,            ,537=0,631=7.2034485 ,            ,10=140
190=1.26237,191=0        ,            ,537=1,              ,            ,10=068

which in turn will need to be converted to this:

190    ,191      ,193     ,537,631       ,642     ,10
1.1204 ,-0.000029,20141008,0  ,1.12029575,0.000145,56
7.20425,0.000141 ,        ,0  ,7.2034485 ,        ,140
1.26237,0        ,        ,1  ,          ,        ,068

I'm in the midst of developing a bash script with awk but I wonder if I can do that in R. At present, my greatest challenge is arriving to the intermediate table. From the intermediate to the final table I thought of using the R with the tidyr package, specifically function 'separate'. If anybody can suggest a better logic, I'll greatly appreciate!

Newmarket answered 28/10, 2014 at 13:37 Comment(2)
Can you clarify how extensive these records are? How many columns and rows of CSV data?Benitez
10000 rows and columns might be over 20, depends on the type of deal. This above is just a snippet to get the point acrossNewmarket
J
4

Another possibility. Start with same scan as @Andrie, but also use arguments strip.white and na.strings:

x <- scan(text = "190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,",
           sep = ",",
           what = "character", 
           strip.white = TRUE,
           na.strings = "")

# remove NA
x <- x[!is.na(x)]

Then use colsplit and dcast from reshape2package:

library(reshape2)

# split 'x' into two columns
d1 <- colsplit(string = x, pattern = "=", names = c("x", "y")) 

# create an id variable, needed in dcast
d1$id <- ave(d1$x, d1$x, FUN = seq_along)   

# reshape from long to wide
d2 <- dcast(data = d1, id ~ x, value.var = "y")

#   id  10     190       191      193 537      631      642
# 1  1  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
# 2  2 140 7.20425  0.000141       NA   0 7.203449       NA
# 3  3  68 1.26237  0.000000       NA   1       NA       NA

Because you mentioned tidyr:

library(tidyr)
d1 <- separate(data = data.frame(x), col = x, into = c("x", "y"), sep = "=")
d1$id <- ave(d1$x, d1$x, FUN = seq_along)
spread(data = d1, key = x, value = y)
#   id  10     190       191      193 537        631      642
# 1  1  56  1.1204 -0.000029 20141008   0 1.12029575 0.000145
# 2  2 140 7.20425  0.000141     <NA>   0  7.2034485     <NA>
# 3  3 068 1.26237         0     <NA>   1       <NA>     <NA>

This retains the values as character. If you want numeric, you can set convert = TRUE in spread.

Jeffreys answered 28/10, 2014 at 14:51 Comment(4)
Wow, that looks interesting, will test and revert! I knew R guys had seen this question before, thanks Henrik!Newmarket
@Jeffreys Could I ask a favour? Would you explain how ave(d1$x, d1$x, FUN = seq_along) creates the grouping variable? I would like to know how this line works.Boris
@jazzurro, From ?ave: "Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.", with the default FUN, which is mean. In this case it could be described as "apply a function FUN, here FUN = seq_along, on Subsets of d1$x, where each subset consist of those observations with the same factor levels, as defined by the grouping variable d1$x". ave index the result according to the order of the original vector.Jeffreys
@Jeffreys Thank you very much for the explanation. I shuffled d1 and checked the behaviour of the ave line. Now I see your point. Since the order of id sequence was perfect (seven 1s, five 2s, and four 3s) I could not see what was going on; I thought the ave line was assigning something like a group variable. Silly of me. Thank you for taking your time.Boris
M
4

EDITED. Full solution using base R functions only:

dat <- scan(sep=",", what="character", text="190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,")

dat <- gsub(" ", "", dat)
dat <- dat[dat != ""]

x <- as.data.frame(
  matrix(
    unlist(
      sapply(dat, strsplit, split = "=", USE.NAMES=FALSE)
    ),
    ncol=2, byrow=TRUE
  )
)

z <- unstack(x, V2 ~ V1)

The resulting object is a named list that is close to what you wanted. You will have to do some additional work to convert this into matrix, if required.

$`10`
[1] "56"  "140" "068"

$`190`
[1] "1.1204"  "7.20425" "1.26237"

$`191`
[1] "-0.000029" "0.000141"  "0"   

....
etc.     

From here, you simply have to pad the list with the appropriate number of NA values:

maxLength <- max(sapply(z, length))
sapply(z, function(x)c(as.numeric(x), rep(NA, maxLength - length(x))))

Gives:

      10     190       191      193 537      631      642
[1,]  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
[2,] 140 7.20425  0.000141       NA   0 7.203449       NA
[3,]  68 1.26237  0.000000       NA   1       NA       NA
Margarettamargarette answered 28/10, 2014 at 13:57 Comment(6)
thanks. I'm not sure this works yet, I'd expect to see 3 variable for each tag, say, $642 [1] "0.000145", "", ""Newmarket
Correct, that's why I say you have to you'll have to do some additional work to get into matrix, mainly by adding NAs as appropriate.Margarettamargarette
sadly, that crashes the system. There are 10000 rows and about 20 fields, and R crashes. I might have to go the bash/awk route.Newmarket
@fabiog might as well use quickfix if you're open to using other software.Nuclease
@Nuclease yeah all that log is created by QF/J, but I don't have access to the app itself, only to the logs. I wish I had, that would be problem solved!Newmarket
@Andrie, Nice base answer (+1)! I was just wondering if you need sapply? identical(sapply(dat, strsplit, split = "=", USE.NAMES=FALSE), strsplit(dat, split = "="))Jeffreys
J
4

Another possibility. Start with same scan as @Andrie, but also use arguments strip.white and na.strings:

x <- scan(text = "190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,",
           sep = ",",
           what = "character", 
           strip.white = TRUE,
           na.strings = "")

# remove NA
x <- x[!is.na(x)]

Then use colsplit and dcast from reshape2package:

library(reshape2)

# split 'x' into two columns
d1 <- colsplit(string = x, pattern = "=", names = c("x", "y")) 

# create an id variable, needed in dcast
d1$id <- ave(d1$x, d1$x, FUN = seq_along)   

# reshape from long to wide
d2 <- dcast(data = d1, id ~ x, value.var = "y")

#   id  10     190       191      193 537      631      642
# 1  1  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
# 2  2 140 7.20425  0.000141       NA   0 7.203449       NA
# 3  3  68 1.26237  0.000000       NA   1       NA       NA

Because you mentioned tidyr:

library(tidyr)
d1 <- separate(data = data.frame(x), col = x, into = c("x", "y"), sep = "=")
d1$id <- ave(d1$x, d1$x, FUN = seq_along)
spread(data = d1, key = x, value = y)
#   id  10     190       191      193 537        631      642
# 1  1  56  1.1204 -0.000029 20141008   0 1.12029575 0.000145
# 2  2 140 7.20425  0.000141     <NA>   0  7.2034485     <NA>
# 3  3 068 1.26237         0     <NA>   1       <NA>     <NA>

This retains the values as character. If you want numeric, you can set convert = TRUE in spread.

Jeffreys answered 28/10, 2014 at 14:51 Comment(4)
Wow, that looks interesting, will test and revert! I knew R guys had seen this question before, thanks Henrik!Newmarket
@Jeffreys Could I ask a favour? Would you explain how ave(d1$x, d1$x, FUN = seq_along) creates the grouping variable? I would like to know how this line works.Boris
@jazzurro, From ?ave: "Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.", with the default FUN, which is mean. In this case it could be described as "apply a function FUN, here FUN = seq_along, on Subsets of d1$x, where each subset consist of those observations with the same factor levels, as defined by the grouping variable d1$x". ave index the result according to the order of the original vector.Jeffreys
@Jeffreys Thank you very much for the explanation. I shuffled d1 and checked the behaviour of the ave line. Now I see your point. Since the order of id sequence was perfect (seven 1s, five 2s, and four 3s) I could not see what was going on; I thought the ave line was assigning something like a group variable. Silly of me. Thank you for taking your time.Boris

© 2022 - 2024 — McMap. All rights reserved.