How to use PyCall in Julia to convert Python output to Julia DataFrame
Asked Answered
E

3

7

I would like to retrieve some data from quandl and analyse them in Julia. There is, unfortunately, no official API available for this (yet). I am aware of this solution, but it is still quite limited in functionality and doesn't follow the same syntax as the original Python API.

I thought it would be a smart thing to use PyCall to retrieve the data using the official Python API from within Julia. This does yield an output, but I'm not sure how I can convert it to a format that I would be able to use within Julia (ideally a DataFrame).

I have tried the following.

using PyCall, DataFrames
@pyimport quandl

data = quandl.get("WIKI/AAPL", returns = "pandas");

Julia converts this output to a Dict{Any,Any}. When using returns = "numpy" instead of returns = "pandas", I end up with a PyObject rec.array.

How can I get data to be a Julia DataFrame as quandl.jl would return it? Note that quandl.jl is not an option for me because it doesn't support automatic retrieval of multiple assets and lacks several other features, so it's essential that I can use the Python API.

Thank you for any suggestions!

Eagleeyed answered 6/3, 2017 at 1:23 Comment(0)
D
4

You're running into a difference in Python/Pandas versions. I happen to have two configurations easily available to me; Pandas 0.18.0 in Python 2 and Pandas 0.19.1 in Python 3. The answer @niczky12 provided works well in the first configuration, but I'm seeing your Dict{Any,Any} behavior in the second configuration. Basically, something changes between those two configurations such that PyCall detects a mapping-like interface for Pandas objects and then exposes that interface as a dictionary through an automatic conversion. There are two options here:

  1. Work with the dictionary interface:

    data = quandl.get("WIKI/AAPL", returns = "pandas")
    cols = keys(data)
    df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))
    
  2. Explicitly disable the auto-conversion and use the PyCall interface to extract the columns as niczky12 demonstrated in the other answer. Note that data[:Open] will do auto-conversion to a mapped dictionary and data["Open"] will just return a PyObject.

    data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
    cols = data[:columns]
    df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
    

In both cases, though, note that the all-important date index isn't included in the resulting data frame. You almost certainly want to add that as a column:

df[:Date] = collect(data[:index])
Diazomethane answered 23/3, 2017 at 22:53 Comment(2)
There's actually a third option — disable auto-conversion and then wrap the PyObject with the Pandas.jl library. That doesn't get you into the JuliaStats ecosystem, but it simplifies using Pandas analysis functions from within Julia.Diazomethane
Good spot! I had a feeling this issue was due to Python2 vs Python3. Also, I agree the data index should be included.Elvinelvina
E
5

Here's one option:

First, extract the column names from you data object:

julia> colnames = map(Symbol, data[:columns]);
12-element Array{Symbol,1}:
 :Open                
 :High                
 :Low                 
 :Close               
 :Volume              
 Symbol("Ex-Dividend")
 Symbol("Split Ratio")
 Symbol("Adj. Open")  
 Symbol("Adj. High")  
 Symbol("Adj. Low")   
 Symbol("Adj. Close") 
 Symbol("Adj. Volume")

Then pour all your columns into an DataFrame:

julia> y = DataFrame(Any[Array(data[c]) for c in colnames], colnames)

6×12 DataFrames.DataFrame
│ Row │ Open  │ High  │ Low   │ Close │ Volume   │ Ex-Dividend │ Split Ratio │
├─────┼───────┼───────┼───────┼───────┼──────────┼─────────────┼─────────────┤
│ 1   │ 28.75 │ 28.87 │ 28.75 │ 28.75 │ 2.0939e6 │ 0.0         │ 1.0         │
│ 2   │ 27.38 │ 27.38 │ 27.25 │ 27.25 │ 785200.0 │ 0.0         │ 1.0         │
│ 3   │ 25.37 │ 25.37 │ 25.25 │ 25.25 │ 472000.0 │ 0.0         │ 1.0         │
│ 4   │ 25.87 │ 26.0  │ 25.87 │ 25.87 │ 385900.0 │ 0.0         │ 1.0         │
│ 5   │ 26.63 │ 26.75 │ 26.63 │ 26.63 │ 327900.0 │ 0.0         │ 1.0         │
│ 6   │ 28.25 │ 28.38 │ 28.25 │ 28.25 │ 217100.0 │ 0.0         │ 1.0         │

│ Row │ Adj. Open │ Adj. High │ Adj. Low │ Adj. Close │ Adj. Volume │
├─────┼───────────┼───────────┼──────────┼────────────┼─────────────┤
│ 1   │ 0.428364  │ 0.430152  │ 0.428364 │ 0.428364   │ 1.17258e8   │
│ 2   │ 0.407952  │ 0.407952  │ 0.406015 │ 0.406015   │ 4.39712e7   │
│ 3   │ 0.378004  │ 0.378004  │ 0.376216 │ 0.376216   │ 2.6432e7    │
│ 4   │ 0.385453  │ 0.38739   │ 0.385453 │ 0.385453   │ 2.16104e7   │
│ 5   │ 0.396777  │ 0.398565  │ 0.396777 │ 0.396777   │ 1.83624e7   │
│ 6   │ 0.420914  │ 0.422851  │ 0.420914 │ 0.420914   │ 1.21576e7   │

Thanks to @Matt B. for the suggestions to simplify the code.

The problem with the above is that the column types are Any inside the dataframe. To make it a bit more efficient here are a few functions which get the job done:

# first, guess the Julia equivalent of type of the object
function guess_type(x::PyCall.PyObject)
  string_dtype = x[:dtype][:name]
  julia_string = string(uppercase(string_dtype[1]), string_dtype[2:end])

  return eval(parse("$julia_string"))
end

# convert an individual column, falling back to Any array if the guess was wrong
function convert_column(x)
  y = try Array{guess_type(x)}(x) catch Array(x) end
  return y
end

# put everything together into a single function
function convert_pandas(df)
  colnames =  map(Symbol, data[:columns])
  y = DataFrame(Any[convert_column(df[c]) for c in colnames], colnames)

  return y
end

The above, when applied to your data gives the same column names as before, but with correct Float64 column types:

y = convert_pandas(data);
showcols(y)
9147×12 DataFrames.DataFrame
│ Col # │ Name        │ Eltype  │ Missing │
├───────┼─────────────┼─────────┼─────────┤
│ 1     │ Open        │ Float64 │ 0       │
│ 2     │ High        │ Float64 │ 0       │
│ 3     │ Low         │ Float64 │ 0       │
│ 4     │ Close       │ Float64 │ 0       │
│ 5     │ Volume      │ Float64 │ 0       │
│ 6     │ Ex-Dividend │ Float64 │ 0       │
│ 7     │ Split Ratio │ Float64 │ 0       │
│ 8     │ Adj. Open   │ Float64 │ 0       │
│ 9     │ Adj. High   │ Float64 │ 0       │
│ 10    │ Adj. Low    │ Float64 │ 0       │
│ 11    │ Adj. Close  │ Float64 │ 0       │
│ 12    │ Adj. Volume │ Float64 │ 0       │
Elvinelvina answered 6/3, 2017 at 8:57 Comment(5)
It seems you cannot index into a Dict{Any}{Any} object using Symbols. I tried using Strings instead; I think this may have changed in a recent version but should work once I figure out the type conversion. Array(data[colname]) returns MethodError: Cannot convert an object of type Dict{Any}{Any} to an object of type Array{T}{N}. I'm on version 0.5.0.Eagleeyed
I did convert the column names to symbols, using colnames = map(x -> Symbol(String(x)), data[:columns][:values]) Have you done that too? It works fine on my machine, and I'm using 0.5 as well.Elvinelvina
What versions of PyCall and DataFrames are you using? This should work just fine. It can be slightly simpler and add column names with: cols = map(Symbol, data[:columns]); DataFrame(Any[Array(data[c]) for c in cols], cols)Diazomethane
@Eagleeyed I edited the answer using the above suggestions and added functions to make the conversion simpler.Elvinelvina
map(Symbol, data[:columns]) returns EEROR: KeyError: key :columns not found. I tried map(Symbol, keys(data)) which returns a Symbol array, which I cannot use to index into the Dict. I tried colnames = keys(data) to index into the Dict using strings rather than symbols, but this again gives me the above MethodError. I'm very confused by why this would work on your machine and not mine. Did you load any extra packages?Eagleeyed
D
4

You're running into a difference in Python/Pandas versions. I happen to have two configurations easily available to me; Pandas 0.18.0 in Python 2 and Pandas 0.19.1 in Python 3. The answer @niczky12 provided works well in the first configuration, but I'm seeing your Dict{Any,Any} behavior in the second configuration. Basically, something changes between those two configurations such that PyCall detects a mapping-like interface for Pandas objects and then exposes that interface as a dictionary through an automatic conversion. There are two options here:

  1. Work with the dictionary interface:

    data = quandl.get("WIKI/AAPL", returns = "pandas")
    cols = keys(data)
    df = DataFrame(Any[collect(values(data[c])) for c in cols], map(Symbol, cols))
    
  2. Explicitly disable the auto-conversion and use the PyCall interface to extract the columns as niczky12 demonstrated in the other answer. Note that data[:Open] will do auto-conversion to a mapped dictionary and data["Open"] will just return a PyObject.

    data = pycall(quandl.get, PyObject, "WIKI/AAPL", returns = "pandas")
    cols = data[:columns]
    df = DataFrame(Any[Array(data[c]) for c in cols], map(Symbol, cols))
    

In both cases, though, note that the all-important date index isn't included in the resulting data frame. You almost certainly want to add that as a column:

df[:Date] = collect(data[:index])
Diazomethane answered 23/3, 2017 at 22:53 Comment(2)
There's actually a third option — disable auto-conversion and then wrap the PyObject with the Pandas.jl library. That doesn't get you into the JuliaStats ecosystem, but it simplifies using Pandas analysis functions from within Julia.Diazomethane
Good spot! I had a feeling this issue was due to Python2 vs Python3. Also, I agree the data index should be included.Elvinelvina
K
1

There is an API. Just use Quandl.jl: https://github.com/milktrader/Quandl.jl

using Quandl
data = quandlget("WIKI/AAPL")

This has the added advantage of getting the data in a useful Julia format (a TimeArray), which has appropriate methods defined for working with such data.

Kinder answered 27/3, 2017 at 15:8 Comment(2)
Thank you for your answer. I am aware of this option (see my question), but this is an unofficial API and it's functionality is very limited. For example, I have not managed to selectively retrieve multiple series and have them returned in a date-matched DataFrame object. The official Python API supports this, so I would like to build a wrapper for it instead of using the unofficial API.Eagleeyed
Oh, I didn't realize that was what you linked to. OK then, I hope you can use the competent advice you got by the other answers, then.Kinder

© 2022 - 2024 — McMap. All rights reserved.