At the end I built the JSON manually with string concatenation,
# UPLOAD AS JSON
UNLOAD ('SELECT CHR(123)||
\'"receiver_idfa":"\'||nvl(receiver_idfa,\'\')||\'",\'||
\'"brand":"\'||nvl(brand,\'\')||\'",\'||
\'"total":\'||nvl(total,0)||\',\'||
\'"screen_dpi":\'||nvl(screen_dpi,0)||\',\'||
\'"city":"\'||nvl(city,\'\')||\'",\'||
\'"wifi":\'||nvl(convert(integer,wifi),0)||\',\'||
\'"duration":\'||nvl(duration,0)||\',\'||
\'"carrier":"\'||nvl(carrier,\'\')||\'",\'||
\'"screen_width":\'||nvl(screen_width,0)||\',\'||
\'"time":\'||nvl("time",0)||\',\'||
\'"ts":"\'||nvl(ts,\'1970-01-01 00:00:00\')||\'",\'||
\'"month":\'||nvl(month,0)||\',\'||
\'"year":\'||nvl(year,0)||\',\'||
\'"day":\'||nvl(day,0)||\',\'||
\'"hour":\'||nvl(hour,0)||\',\'||
\'"minute":\'||nvl(minute,0)||
chr(125) from event_logs')
TO 's3://BUCKET/PREFIX/KEY'
WITH CREDENTIALS AS 'CREDENTIALS...'
GZIP
DELIMITER AS '\t'
;
Here,
nvl
function is used for replacing nulls
convert
is used for replacing booleans to int
||
is concatenation operator in Redshift
chr
is used to generate {
and }
character
This operation is not as fast as just unloading as csv. It'll take 2-3x longer time. But as we need to do it once, its fine. I unloaded around 1600 million records and imported all of them in Spark successfully.
Note: Parsing json by spark is not the efficient way. There are other formats which are faster, like parquet file, sequence file. So for spark this might not be a correct path. But for unloading as JSON you can use this solution.