Getting question marks when inserting Hebrew characters into a MySQL table
Asked Answered
M

2

5

I'm using Netbeans building a web application using Java, JSP that handle a database with Hebrew fields.

The DDL is as follows:

String cityTable = "CREATE TABLE IF NOT EXISTS hebrew_test.table ("
                            +"id int(11) NOT NULL AUTO_INCREMENT,"
                            +"en varchar(30) NOT NULL,"
                            +"he varchar(30) COLLATE utf8_bin NOT NULL,"
                            +"PRIMARY KEY (id)"
                            +") ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1;";
String insert = "INSERT INTO hebrew_test.table (en, he) VALUES ('A','a')";
String insert2 = "INSERT INTO hebrew_test.table (en, he) VALUES ('B','ב')";
String insert3 = "INSERT INTO hebrew_test.table (en, he) VALUES ('C','אבג')";


executeSQLCommand(cityTable);
executeSQLCommand(insert);
executeSQLCommand(insert2);
executeSQLCommand(insert3);

The output tabel I get:

1   A   a
2   B   ?
3   C   ???

Instead of:

1   A   a
2   B   ב
3   C   אבג

I tried Hebrew appears as question marks in Netbeans, but that isn't the same problem. I get the question marks in the table.

Also I defined the table to be in UTF8_bin as you can see in the above code.

Mawson answered 1/1, 2013 at 18:15 Comment(2)
What database are you using? (Also, it's 'plumber').Anjaanjali
im using SQL database, the code for builing the db: _statement.executeUpdate("CREATE DATABASE IF NOT EXISTS prodb");Mawson
E
7

You need to tell the JDBC driver to use UTF-8 encoding while decoding the characters representing the SQL query to bytes. You can do that by adding useUnicode=yes and characterEncoding=UTF-8 query parameters to the JDBC connection URL.

jdbc:mysql://localhost:3306/db_name?useUnicode=yes&characterEncoding=UTF-8

It will otherwise use the operating system platform default charset. The MySQL JDBC driver is itself well aware about the encoding used in both the client side (where the JDBC code runs) and the server side (where the DB table is). Any character which is not covered by the charset used by the DB table will be replaced by a question mark.

See also:

Eulaliaeulaliah answered 2/1, 2013 at 2:37 Comment(1)
You are the best, my friend!!! I've been searching the solution for a long time. Changed almost all the character sets and collations :D Thanks a lot!Iow
K
2

You're including your values directly into the SQL. That's always a bad idea. Use a PreparedStatement, parameterized SQL, and set the values as parameters. It may not fix the problem - but it's definitely the first thing to attempt, as you should be using parameterized SQL anyway. (Parameterized SQL avoids SQL injection attacks, separates code from data, and avoids unnecessary conversions.)

Next, you should work out exactly where the problem is really occurring:

  • Make sure that the value you're trying to insert is correct.
  • Check that the value you retrieve is correct.
  • Check what's in your web response using Wireshark - check the declared encoding and what's in the actual data

When checking the values, you should iterate over each character in the string and print out the value as a UTF-16 code unit (either use toCharArray() or use charAt() in a loop). Just printing the value to the console leaves too much chance of other problems.

EDIT: For a little context of why I wrote this as an answer:

  • In my experience, including string values as parameters rather than directly into SQL can sometimes avoid such issues (and is of course better for security reasons etc).
  • In my experience, diagnosing whether the problem is at the database side or the web side is also important. This diagnosis is best done via logging the exact UTF-16 code units being used, not just strings (as otherwise further encoding issues during logging or console output can occur).
  • In my experience, problems like this can easily occur at either insert or read code paths.

All of this is important as a way of moving the OP forward, not just in a comment-like request for more information.

Kendall answered 1/1, 2013 at 18:18 Comment(17)
thanks Jon i found that:roseindia.net/jdbc/prepared-statement-example.shtmlMawson
@MatanTouti: I wouldn't say that's a great page - bad exception handling, no resource clean-up, etc... docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html is somewhat better, although still not perfect...Kendall
the values i trying to insert are correct, by the debugger (is it good enough verification?): allprofessions = "INSERT INTO prodb.professions (id, profession, professionEn) VALUES ( 1,'אחר','other'),( 2,'חשמלאי','electrician'),( 3,'אינסטלטור','plubmber')". The value im getting actually are the table i showed above. And how to check a web response is over my knowledge.. i think that the answer more as a declaration of UTF 8 somewhere (i'm guessing).Mawson
@MatanTouti: Well, as I said - use Wireshark to see exactly what's coming in the response. If you haven't used Wireshark before, now is a good time to learn - all web developers should know how to see what's coming across the wire...Kendall
the server im running is localhost, with xampp. can i use wireshark to examine that although its localhost?Mawson
@MatanTouti: That may make it harder, admittedly. You might be able to use Fiddler (fiddler2.com) instead. Or just find another machine to use to make the request. Alternatively, use curl or something similar - anything which will let you see the exact data in the response, in binary form.Kendall
You're not answering the question in any way. There are only 2 places where question marks would appear when a wrong character encoding is used. The response encoding is clearly scratched from being suspect. The request encoding as you seem to suspect is impossibly the cause. It would only have resulted in Mojibake, not in question marks.Eulaliaeulaliah
@BalusC: I don't think anything is scratched from being suspect, until it's actually proved. I've seen enough bad diagnoses of this sort of thing to be very reluctant to make assumptions. My answer is more about how to diagnose things more clearly than anything else, and I think that's an important part of making progress. So it's not the final solution - but it's a step along the way.Kendall
@BalusC: I think we've had similar disagreements on several questions. I believe my answer is an answer to how to move forward. I think we'll have to agree to disagree on this - I think it's fine as an answer, and I'm not going to move it to a comment.Kendall
@BalusC: This isn't the first time you've basically accused me of providing answers in an attempt to get reputation instead of helping. Please desist; it's simply untrue. We disagree about what is helpful, and what constitutes an answer, but using terms like "potentially vote-generating blah" and "clueless shooting in the dark" is not the way forward, IMO. Feel free to flag for moderator consideration if you're not happy to just leave things be.Kendall
Okay, I think this has gone far enough - further than is constructive. I don't view my answer as "randomly googled crap" nor do I regard Java/servlets/encodings as something "which I know nothing about". I'm flagging your comment for moderator attention, not to try to obtain a penalty or anything similar - just to sort this out.Kendall
As per your update, you're only describing cases where Mojibake will occur, not question marks. In other words, you're confusing Mojibake with question marks. Plus, the OP has clearly demonstrated that even inserting hardcoded Hebrew results in question marks, which thus clearly excludes HTTP from being the cause. The OP has also explicitly mentioned that the problem is visible straight in the table (thus he looked in there using some DB admin tool), so the cause is clearly in JDBC driver part.Eulaliaeulaliah
@BalusC: It excludes HTTP from being the cause at insert, but it wasn't clear (IMO) whether the tool used for diagnostics would have been capable of getting the right result. It seemed likely that the problem was JDBC rather than HTTP, but IMO it was still worth helping the OP to diagnose the problem more specifically. (Personally I'd have written a console app which just did an insert then fetch.) Aside from anything, this is "teach a man how to fish" territory.Kendall
It's very unlikely that using the "wrong" diagnostics tool would have resulted in question marks. More unlikely than HTTP would be the cause. I recommend to read the "See also" link in my answer to learn when question marks would occur instead of Mojibake.Eulaliaeulaliah
@BalusC: Again, I think we'll have to agree to differ on what's helpful and what's not. (I've seen plenty of bad ways of diagnosing issues which can very easily end up with question marks - using a logger with a platform-default encoding could do it for exactly the reasons you've given, for example.)Kendall
Again, you're confusing Mojibake with question marks. Using the wrong logger encoding would have resulted in Mojibake, not in question marks. Please read the "See also" link to learn more.Eulaliaeulaliah
let us continue this discussion in chatKendall

© 2022 - 2024 — McMap. All rights reserved.