Incorrect string value error for unconventional characters
Asked Answered
R

2

6

So I'm using a wrapper to fetch user data from instagram. I want to select the display names for users, and store them in a MYSQL database. I'm having issues inserting some of the display names, dealing with, specifically, an incorrect string value error:

Now, I've dealt with this issue before with accent marks, letters with umlauts, etc. The solution would be to change the collation to utf8_general_ci under the utf8 charset.

So as you can see, some of the display names I'm pulling have very unique characters that I'm not sure mySQL can recognize at all, i.e.:

แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ

So I receive:

Error Code: 1366. Incorrect string value: '\xF0\x9D\x99\x87\xF0\x9D...' for column 'dummy' at row 1

Here's my sql code

CREATE TABLE test_table(
    id INT AUTO_INCREMENT,
    dummy VARCHAR(255),
    PRIMARY KEY(id)
);
    
INSERT INTO test_table (dummy)
VALUES ('แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ');

Any thoughts on a proper charset + collation pair that can handle characters like this? Not sure where to look for a solution, so I come here to see if anyone dealt with this.

P.S., I've tried utf8mb4 charset with utf8mb4_unicode_ci and utf8mb4_bin collations as well.

Reynoso answered 10/5, 2022 at 2:19 Comment(6)
What does this report: select character_set_name from information_schema.columns where table_name='test_table' and column_name='dummy'; โ€“ Dinghy
I predict it's still utf8mb3, because you probably only used ALTER TABLE to change the default character set. That doesn't change existing columns. You need to change them with ALTER TABLE test_table CONVERT TO CHARSET utf8mb4 to change existing columns. After I did that, I was able to insert your example string to the table in my test. โ€“ Dinghy
1. utf8 is the returned result โ€“ Reynoso
2. Actually, I just manually changed the schema in the MYSQL workbench UI. I see, so using the ALTER TABLE and running will change existing columns. Let me try. โ€“ Reynoso
This question has been answered. How do I close the question when an answer was given in the comments? โ€“ Reynoso
@Reynoso The best approach is to post an answer to your own question. Be sure to also accept your answer, and credit the poster that helped you find a solution. It is perfectly OK to answer your own question, and doing that is much more helpful to the community than having an unanswered question with a solution buried in the comments. โ€“ Masoretic
D
3

The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.

The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:

Note

Historically, MySQL has used utf8 as an alias for utf8mb3; beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output of SHOW statements and in Information Schema tables when this character set is meant.

At some point in the future utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

You should also be aware that the utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead.

You may have tried to change your table in the following way:

ALTER TABLE test_table CHARSET=utf8mb4;

But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:

ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;

Or to convert all string or TEXT columns in a table in one statement:

ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;
Dinghy answered 10/5, 2022 at 15:10 Comment(1)
Also, the connection needs to declare utf8mb4. โ€“ Tub
T
0

That would be ๐™‡ - L MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL L

It requires the utf8mb4 Character set to even represent it. "F0" is the clue; it is the first of 4 bytes in a 4-byte UTF-8 character. It cannot be represented in MySQL's "utf8". Collation is (mostly) irrelevant.

Most, not all, of the characters in แ›˜๐•ฐ๐–†๐–—๐–™๐– ๐•พ๐–•๐–Ž๐–—๐–Ž๐–™๐–š๐–˜๐‚‚ยฎ also need utf8mb4. They are "MATHEMATICAL BOLD FRAKTUR" letters.

(Meanwhile, Bill gives you more of an answer.)

Tub answered 13/5, 2022 at 3:43 Comment(0)

© 2022 - 2024 โ€” McMap. All rights reserved.