Livy Server: return a dataframe as JSON?
Asked Answered
D

3

9

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body

{
  "code": "spark.sql(\"select * from test_table limit 10\")"
}

I would like an answer in the following format

(...)
"data": {
  "application/json": "[
    {"id": "123", "init_date": 1481649345, ...},
    {"id": "133", "init_date": 1481649333, ...},
    {"id": "155", "init_date": 1481642153, ...},
  ]"
}
(...)

but what I'm getting is

(...)
"data": {
  "text/plain": "res0: org.apache.spark.sql.DataFrame = [id: string, init_date: timestamp ... 64 more fields]"
}
(...)

Which is the toString() version of the dataframe.

Is there some way to return a dataframe as JSON using the Livy Server?

EDIT

Found a JIRA issue that addresses the problem: https://issues.cloudera.org/browse/LIVY-72

By the comments one can say that Livy does not and will not support such feature?

Doubly answered 13/12, 2016 at 17:23 Comment(0)
P
3

I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:

{
  "code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}

Then, you will have a JSON wrapped in a string, so your client could parse it.

Perpetuity answered 14/12, 2016 at 8:27 Comment(1)
That was it! According to the JIRA issue, Livy wasn't actually meant to do what I wanted, but your solution works perfectly, thanks!Doubly
I
4

I recommend using the built-in (albeit hard to find documentation for) magics %json and %table:

%json

session_url = host + "/sessions/1"
statements_url = session_url + '/statements'
data = {
        'code': textwrap.dedent("""\
        val d = spark.sql("SELECT COUNT(DISTINCT food_item) FROM food_item_tbl")
        val e = d.collect
        %json e
        """)}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()

%table

session_url = host + "/sessions/21"
statements_url = session_url + '/statements'
data = {
        'code': textwrap.dedent("""\
        val x = List((1, "a", 0.12), (3, "b", 0.63))
        %table x
        """)}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()

Related: Apache Livy: query Spark SQL via REST: possible?

Instauration answered 11/10, 2017 at 18:49 Comment(6)
these statements needs to be part of application where I am calling or it can be saved at server side and can be retrieved later.Which is preferred way?Lavish
These magics (%json and %table) are only useful when called from the application. Caching the data frame(s) that you ultimately use to derive the final results in your Livy session would likely be very wiseInstauration
It means if I am running a spark job via livy and use cache() option for any dataframe,i can take that dataframe in magics(%json and %table) to return to client.Lavish
Yes @UtkarshSaraf that is correct. I believe (based on my own code example above) that you need to collect the results first to a list rather than directly call the magics on a spark dataframe.Instauration
I have been able to execute above code.I saw that when I GET then only I have been able to get the json data. Is it not possible to link that output of statement directly to batch and get results excluding extra sentence of "GET" curlLavish
You should probably mention that this is Python code not native Spark. Do you have any reference for the magics?Gimpel
P
3

I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:

{
  "code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}

Then, you will have a JSON wrapped in a string, so your client could parse it.

Perpetuity answered 14/12, 2016 at 8:27 Comment(1)
That was it! According to the JIRA issue, Livy wasn't actually meant to do what I wanted, but your solution works perfectly, thanks!Doubly
J
0

I think in general your best bet is to write your output to a database of some kind. If you write to a randomly named table, you could have your code read it after the script is done.

Judas answered 15/3, 2017 at 22:9 Comment(1)
I think it depends on the size of the dataframe. I've run into issues where even though I tried to send the data back as JSON, Livy ran out of memory receiving it. (Spark job master/executors were fine on memory, Livy ran out. Probably adjustable.)Deaconry

© 2022 - 2024 — McMap. All rights reserved.