AWS Datapipeline - issue with accented characters
Asked Answered
R

3

1

I am new to AWS datapipeline. I created a successful datapipeline to pull all the content from RDS to S3 bucket. Everything works. I see my .csv file in S3 bucket. But I am storing spanish names in my table, in csv I see "Garc�a" instead of "García"

Rucksack answered 12/1, 2017 at 21:19 Comment(0)
S
3

Looks like the wrong codepage is used. Just reference the correct codepage and you should be fine. The following topic might help: Text files uploaded to S3 are encoded strangely?

Scrannel answered 19/1, 2017 at 15:16 Comment(3)
S3 files are dynamically generated. I need a solution to fix it in datapipeline.Rucksack
Is the pipeline-transfer changing your data? That would make no sense however i could be wrong. Your Data is misinterpreted by either your Export or Import process thorugh usage of a wrong codepage ( for your needs).Scrannel
My data is fine. My apis are working fine and serving website fine. I am trying to get downloadable csv which is giving me trouble. S3 is not an issue either.Rucksack
S
1

AWS DataPipeline is implemented in Java, and uses JDBC (Java Database Connectivity) drivers (specifically, MySQL Connector/J for MySQL in your case) to connect to the database. According to the Using Character Sets and Unicode section of the documentation, the character set used by the connector is automatically determined based on the character_set_server system variable on the RDS/MySQL server, which is set to latin1 by default.

If this setting is not correct for your application (run SHOW VARIABLES LIKE 'character%'; in a MySQL client to confirm), you have two options to correct this:

  1. Set character_set_server to utf8 on your RDS/MySQL server. To make this change permanently from the RDS console, see Modifying Parameters in a DB Parameter Group for instructions.
  2. Pass additional JDBC properties in your DataPipeline configuration to override the character set used by the JDBC connection. For this approach, add the following JDBC properties to your RdsDatabase or JdbcDatabase object (see properties reference):

    "jdbcProperties": "useUnicode=true,characterEncoding=UTF-8"

Singularize answered 22/1, 2017 at 2:31 Comment(5)
I'm not 100% confident about the syntax for passing multiple properties to jdbcProperties- the documentation only says "Pairs of the form A=B that will be set as properties on jdbc connections for this database". It might instead be useUnicode=true&characterEncoding=UTF-8 or something else entirely. Let me know if either form works if you try this option.Singularize
You are right. It gave me an error - The connection property 'allowMultiQueries' only accepts values of the form: 'true', 'false', 'yes' or 'no'. The value 'true,useUnicode=true,characterEncoding=UTF-8' is not in this set.Rucksack
and is not correct either. The connection property 'allowMultiQueries' only accepts values of the form: 'true', 'false', 'yes' or 'no'. The value 'true&useUnicode=true&characterEncoding=UTF-8' is not in this set.Rucksack
OK, here's two more syntax ideas to try: 1. Multiple jdbcProperties keys, one for each property you wish to set: "jdbcProperties": "useUnicode=true", "jdbcProperties": "characterEncoding=UTF-8"; 2. Pass an array to jdbcProperties: "jdbcProperties": ["useUnicode=true", "characterEncoding=UTF-8"]. Let me know if either works.Singularize
added as per your instructions. No errors. Thank you. But...the result is same "Gonz�lez" :(Rucksack
M
0

This question is a little similar to this Text files uploaded to S3 are encoded strangely?. If so, kindly reference my answer there.

Meathead answered 21/1, 2017 at 19:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.