Export data from Amazon Redshift as JSON
Asked Answered
M

4

6

We are migrating from Redshift to Spark. I have a table in Redshift that I need to export to S3. From S3 this will be fed to Apache Spark (EMR).

I found there is only one way to export data from Redshift. And that is UNLOAD command. And unload can not export typed data. It exports csv which is a table of strings. Based on different format (quote, delimiter etc) Spark doesn't seem to recognize it well. So I am looking for a way to unload them and make sure they are read by spark with proper type.

Is there any way to unload data as JSON or other typed format that is recognizable to Spark?

Mosera answered 25/10, 2016 at 10:27 Comment(0)
M
6

At the end I built the JSON manually with string concatenation,

# UPLOAD AS JSON
UNLOAD ('SELECT CHR(123)||
\'"receiver_idfa":"\'||nvl(receiver_idfa,\'\')||\'",\'||
\'"brand":"\'||nvl(brand,\'\')||\'",\'||
\'"total":\'||nvl(total,0)||\',\'||
\'"screen_dpi":\'||nvl(screen_dpi,0)||\',\'||
\'"city":"\'||nvl(city,\'\')||\'",\'||
\'"wifi":\'||nvl(convert(integer,wifi),0)||\',\'||
\'"duration":\'||nvl(duration,0)||\',\'||
\'"carrier":"\'||nvl(carrier,\'\')||\'",\'||
\'"screen_width":\'||nvl(screen_width,0)||\',\'||
\'"time":\'||nvl("time",0)||\',\'||
\'"ts":"\'||nvl(ts,\'1970-01-01 00:00:00\')||\'",\'||
\'"month":\'||nvl(month,0)||\',\'||
\'"year":\'||nvl(year,0)||\',\'||
\'"day":\'||nvl(day,0)||\',\'||
\'"hour":\'||nvl(hour,0)||\',\'||
\'"minute":\'||nvl(minute,0)||
chr(125) from event_logs')                                                                                              
TO 's3://BUCKET/PREFIX/KEY'
WITH CREDENTIALS AS 'CREDENTIALS...' 
GZIP
DELIMITER AS '\t'
;

Here,

  1. nvl function is used for replacing nulls
  2. convert is used for replacing booleans to int
  3. || is concatenation operator in Redshift
  4. chr is used to generate { and } character

This operation is not as fast as just unloading as csv. It'll take 2-3x longer time. But as we need to do it once, its fine. I unloaded around 1600 million records and imported all of them in Spark successfully.

Note: Parsing json by spark is not the efficient way. There are other formats which are faster, like parquet file, sequence file. So for spark this might not be a correct path. But for unloading as JSON you can use this solution.

Mosera answered 16/11, 2016 at 12:15 Comment(1)
This is awesome and helped me out a lot! Thank you!Newcastle
N
2

Check out the spark-redshift library, which is designed to allow Apache Spark to do bulk reads from Redshift using UNLOAD; it automatically manages the escaping and schema handling.

You can either run Spark queries directly against the data loaded from Redshift or you can save the Redshift data into a typed format like Parquet and then query that data.

Full disclosure: I'm the primary maintainer of that library.

Nato answered 15/11, 2016 at 1:2 Comment(1)
We used this first. spark-redshift uses s3 for transferring data. I was wondering if there is any way to use this library to just unload all the data for good? I'll just process using spark later on. But it'll be in s3.Mosera
B
1

Since Redshift is RDBMS based on postgresql; there might not be a easy way to extract to json. After you run the unload command which spits out 'csv', you could convert that csv to json format

Here is a github project: https://github.com/darwin/csv2json

If you would like a command based tool: https://www.npmjs.com/package/csvtojson

Bathyscaphe answered 15/11, 2016 at 0:48 Comment(0)
V
0

Starting Feb-22 RedShift natively supports unloading data in JSON format. Refer blogpost for further details.

Vaticide answered 20/6, 2022 at 18:47 Comment(1)
This is a link only answer. Please elaborate more in the answer itself.Mosera

© 2022 - 2024 — McMap. All rights reserved.