String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:
Asked Answered
F

4

12

Team, I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)". It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:

error code3: 
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).

Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.

could you please help me how to replace the character with valid UTF-8 code ?

when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.

Please guide me how to replace the character in the input delimited file.

Thanks...

Fries answered 1/3, 2014 at 3:49 Comment(1)
Show us the code you're using that gives you this, along with a minimal repeatable example, and I'll try to help.Forrester
R
17

I've run into this issue in RedShift while loading TPC-DS datasets for experiments.

Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090

And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars

You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.

Try this:

copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS 
delimiter '|' region 'us-region-1';

Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0 rows affected
COPY executed successfully

Execution time: 33.51s
Roush answered 24/2, 2017 at 21:0 Comment(2)
I was unable to use ACCEPTINVCHARS because my source data was in PARQUET. Creating a Spectrum table and performing insert into select worked for meSwinford
I used ACCEPTINVCHARS, but special characters such as "ä" appeared as question mark "?" !Pajamas
P
4

Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes

cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv
Polycrates answered 1/3, 2014 at 18:14 Comment(1)
This helped - I can also recommend chardet to attempt to sniff out your source encoding, if that's not known.Judicious
K
3

For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:

  1. First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
  2. Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.

Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb

Kingsbury answered 29/8, 2019 at 22:31 Comment(1)
Thanks; saving to CSV (Comma delimited) (*.csv) from Excel solved the issuePajamas
S
0

Noticed Athena external table is able to parse data which Redshift copy command unable to do. We can use below alternative approach when encountering - String contains invalid or unsupported UTF8 codepoints Bad UTF8 hex sequence: 8b (error 3).

Follow below steps, if you want to load data into redshift database db2 and table table2.

  • Have a Glue crawler IAM role ready which has access to S3.

  • Run crawler.

  • Validate table and database in Athena created by Glue crawler, say external db1_ext, table1_ext

  • Login to redshift and create linking with Glue Catalog by creating Redshift schema (db1_schema) using below command.

    CREATE EXTERNAL SCHEMA db1_schema FROM DATA CATALOG DATABASE 'db1_ext' IAM_ROLE 'arn:aws:iam:::role/my-redshift-cluster-role';

  • Load from external table INSERT INTO db2.table2 (SELECT * FROM db1_schema.table1_ext)

Scandium answered 17/8, 2021 at 0:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.