loading Redshift from S3 (with partitions)

Asked 15/4, 2016 at 21:28 Answered 4/1, 2022 at 22:20

Solved amazon-web-services amazon-s3 hive amazon-redshift

My S3 file (s3://MY_s3/stage/my_s3_file) has a partition on load_dt:

S3 Structure:

-MY_S3
 --stage
  ---my_s3_file
    ----load_dt=2016-02-01
         ----000024
    ----load_dt=2016-02-02
         ----000036

The actual files are under load_dt partitions like 000024 & 000036.

The COPY command works fine if I am not keeping load_dt in Redshift table, but when I am adding load_dt in Redshift table the COPY command fails due to data error as the input layout & target layout are not matching (with extra load_dt on target).

The hive table which creates the S3 file shows partitioned load_dt at the end.

How to make this COPY command work with load_dt being on target Redshift?

I even tried changing S3 locations to s3://MY_s3/stage/my_s3_file/load_dt but no luck.

Dolliedolloff answered 15/4, 2016 at 21:28 Comment(0)

I think I found the answer for my case.

I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.

Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.

Dolliedolloff answered 16/4, 2016 at 2:45 Comment(0)

When Hive (running under Apache Hadoop) creates a partitioned EXTERNAL TABLE, it separates files by directory. For example, all rows where load_dt=2016-02-01 are stored in the directory called load_dt=2016-02-01.

The files within the directory do not store a value for the partition column (load_dt). Instead, the value of the partition column is stored as part of the directory name. As such, renaming the directory will actually change the value in that column for all rows within the directory.

Yes, it's a bit strange but that's how Hive stores its data!

Amazon Redshift can import CSV files (including compressed CSV files) from Amazon S3. It can even import files from multiple sub-directories because it only looks at the path prefix of files to load. However, it does not understand the way that Hive stores partitioned data so it will not load the partition column from the directory name.

Some options:

Add another column in the output data that contains the same value (as you have now done)
Run a command after the COPY that sets the column value based on some calculation (eg from a date field)
Load one directory at a time (complex to achieve and not as efficient)

Selfhelp answered 16/4, 2016 at 23:39 Comment(0)

I think I found the answer for my case.

I was not able to load the hive partition because Hive store that partition value in Hive metadata , that's why the partition column was not there in S3 file.

Now I am adding a new column to S3 through Hive as Load_Dt_New so the S3 file would have the required column for my Redshift COPY command to work.

Dolliedolloff answered 16/4, 2016 at 2:45 Comment(0)

"copy table from 's3://MY_s3/stage/my_s3_file/load_dt' " should work if you can change the file name to "load_dt_20160201" rather than "load_dt=2016-02-01". COPY command will take the last part of S3 path as prefix.

Thaine answered 15/4, 2016 at 22:38 Comment(1)

I am able to load table without any load_dt using S3 location s3://MY_s3/stage/my_s3_file/ but not able to add the last column which is the partition column load_dt. Using s3://MY_s3/stage/my_s3_file/load_dt=2016-02-01 loads only that partition data but without any load_dt value. – Dolliedolloff 16/4, 2016 at 0:39

Redshift 'Copy' command will show errors under mismatched columns between table schema and parquet columns. So when you use range(daily) partition, you may be able to use the below script.

export PGUSER=sample
export PGPASSWORD=sample
export PGHOST=xxxx.xxxx.redshift.amazonaws.com
export PGPORT=5439
export PGDATABASE=xxxx

start_date=`date -d '2018-01-01' +%s`
base_s3='s3://bucket_and_table_root_path/range_column_name='


for day in {0..364}
do
  base_ymd=`date -d @$((${start_date} + 3600 * 24 * ${day})) +'%Y-%m-%d'`
  base_ymd_lower=`echo ${base_ymd} | sed '1,$s/-/_/g'`
  echo ${base_s3}${base_ymd}
  echo 'start-time' `date`
  psql <<EOF
DROP TABLE loading_table_${base_ymd_lower}_temp;
CREATE TABLE loading_table_${base_ymd_lower}_temp (
    col_a VARCHAR(xxxx) ENCODE LZO,
    ...
)
DISTSTYLE even
;
COPY loading_table_${base_ymd_lower}_temp
FROM '${base_s3}${base_ymd}'
iam_role 'arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx'
format as parquet
;
EOF

done

And next, you can use CTAS with UNION keyword in Redshift.

Outlast answered 29/1, 2020 at 6:14 Comment(0)

You can use Redshift Federated query feature to achieve this.

Create an external schema in Redshift pointing to Hive.

Example:

create external schema "redshift_external_schema"
from hive metastore
database 'hive_database'
uri 'hive.host.name.com' port 9083
iam_role 'arn:aws:iam::2188*********:role/redshift-role-name';

Use CTAS statement

create table redshift_schema.table_name as select * from 
redshift_external_schema.table_name

Note: This will not work if your hive tables have complex data types like array, map, struct etc. In that case, you need to run the create table DDL in Redshift and use COPY command.

Calendar answered 4/1, 2022 at 22:20 Comment(0)

Recommended topics

Hot tags