How to exchange Msgpack files between Python and R?
Asked Answered
V

1

13

Consider this simple example

import pandas as pd

mydata = pd.DataFrame({'mytime': [pd.to_datetime('2018-01-01 10:00:00.513'),
                                pd.to_datetime('2018-01-03 10:00:00.513')],
                      'myvariable': [1,2],
                      'mystring': ['hello', 'world']})
mydata
Out[7]: 
  mystring                  mytime  myvariable
0    hello 2018-01-01 10:00:00.513           1
1    world 2018-01-03 10:00:00.513           2

I know I can write that dataframe to msgpack using Pandas:

mydata.to_msgpack('C://Users/john/Documents/mypack')

The problem is: how can I read that msgpack file in R?

Using RcppMsgPack returns some puzzling output that is not a dataframe/tibble

library(tidyverse)
library(RcppMsgPack)

df <- msgpack_read('C://Users/john/Documents/mypack', simplify = TRUE)
 > df
$axes
$axes[[1]]
$axes[[1]]$typ
[1] "index"

$axes[[1]]$name
NULL

$axes[[1]]$klass
[1] "Index"

$axes[[1]]$compress
NULL

$axes[[1]]$data
[1] "mystring"   "mytime"     "myvariable"

$axes[[1]]$dtype
[1] "object"


$axes[[2]]
$axes[[2]]$typ
[1] "range_index"

$axes[[2]]$name
NULL

$axes[[2]]$klass
[1] "RangeIndex"

$axes[[2]]$start
[1] 0

$axes[[2]]$step
[1] 1

$axes[[2]]$stop
[1] 2



$typ
[1] "block_manager"

$blocks
$blocks[[1]]
$blocks[[1]]$shape
[1] 1 2

$blocks[[1]]$klass
[1] "IntBlock"

$blocks[[1]]$compress
NULL

$blocks[[1]]$values
 [1] 01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00
attr(,"EXT")
[1] 0

$blocks[[1]]$locs
$blocks[[1]]$locs$typ
[1] "ndarray"

$blocks[[1]]$locs$dtype
[1] "int64"

$blocks[[1]]$locs$compress
NULL

$blocks[[1]]$locs$ndim
[1] 1

$blocks[[1]]$locs$data
[1] 02 00 00 00 00 00 00 00
attr(,"EXT")
[1] 0

$blocks[[1]]$locs$shape
[1] 1


$blocks[[1]]$dtype
[1] "int64"


$blocks[[2]]
$blocks[[2]]$shape
[1] 1 2

$blocks[[2]]$klass
[1] "DatetimeBlock"

$blocks[[2]]$compress
NULL

$blocks[[2]]$values
 [1] 40 02 0e 64 4d a7 05 15 40 02 ac 86 76 44 06 15
attr(,"EXT")
[1] 0

$blocks[[2]]$locs
$blocks[[2]]$locs$typ
[1] "ndarray"

$blocks[[2]]$locs$dtype
[1] "int64"

$blocks[[2]]$locs$compress
NULL

$blocks[[2]]$locs$ndim
[1] 1

$blocks[[2]]$locs$data
[1] 01 00 00 00 00 00 00 00
attr(,"EXT")
[1] 0

$blocks[[2]]$locs$shape
[1] 1


$blocks[[2]]$dtype
[1] "datetime64[ns]"


$blocks[[3]]
$blocks[[3]]$shape
[1] 1 2

$blocks[[3]]$klass
[1] "ObjectBlock"

$blocks[[3]]$compress
NULL

$blocks[[3]]$values
[1] "hello" "world"

$blocks[[3]]$locs
$blocks[[3]]$locs$typ
[1] "ndarray"

$blocks[[3]]$locs$dtype
[1] "int64"

$blocks[[3]]$locs$compress
NULL

$blocks[[3]]$locs$ndim
[1] 1

$blocks[[3]]$locs$data
[1] 00 00 00 00 00 00 00 00
attr(,"EXT")
[1] 0

$blocks[[3]]$locs$shape
[1] 1


$blocks[[3]]$dtype
[1] "object"



$klass
[1] "DataFrame"

What should I do?

Of course, going back from R to Python would also be nice. Thanks!

Vernal answered 8/4, 2019 at 17:4 Comment(5)
its super long. let me see if I can do itEvangel
@Parfait done my man\Evangel
Yes, looks to be Pythonic elements: dtype, ndarray.... Curious, how does the same R data with msgpack look? And can it be read in Pandas?Bituminous
@Bituminous its an interesting point. I dont know. But maybe we can get started with that side of the equation first :)Evangel
Apparently, the msgpack representation of a pandas DataFrame is very low level, so it cannot translate as is to an R suitable object. Either you write some code which converts the RcppMsgPack output to an R data.frame or you change the process that produces the msgpack file. This latter solution is of course way better: it's a very bad practice to produce output which can only be read with a specific language.Swordbill
A
1

How about you use library(reticulate) in R:

library(reticulate)
pyData = py_run_string("import pandas as pd
mydata = pd.DataFrame({'mytime': [pd.to_datetime('2018-01-01 10:00:00.513'),
                                pd.to_datetime('2018-01-03 10:00:00.513')],
                      'myvariable': [1,2],
                      'mystring': ['hello', 'world']})")

It would yield the desired output:

pyData$mydata
    mystring              mytime myvariable
1    hello 2018-01-01 10:00:00          1
2    world 2018-01-03 10:00:00          2

You could save all the python code in a python file, e.g. mydata.py and use the function py_run_file("mydata.py").

An overview of reticulate can be found here: https://github.com/rstudio/reticulate.

Most interesting for you is probably the description of the type conversions:

enter image description here Source: https://github.com/rstudio/reticulate#type-conversions.

Add-on question - From R to Python:

The type conversion also holds for "sending" data from R to Python, see here: https://rstudio.github.io/reticulate/articles/calling_python.html#sourcing-scripts.

py = py_run_string("def add(x, y):
  return x + y")

py$add(5, 10)
15
Actuate answered 10/4, 2019 at 20:41 Comment(6)
interesting but I need a pure R solutionEvangel
This is a pure R solution. It uses one R package. Yes, it interfaces with a language but so do other R packages that uses RCpp (C++) or RJava!Bituminous
no, unfortunately reticulate does not work with my network setup (I cant get that package to work correctly). So I am looking for something that leverages the msgpack packages in R.Evangel
also, that solution does not address the msgpack Python/R compatibility issue at all.Evangel
Tough! This bypasses I/O needs of reading/writing to and from disk.Bituminous
thats the point of my question bro! computer A generates the msgpack in Python and computer B reads it in R.Evangel

© 2022 - 2024 — McMap. All rights reserved.