Can MongoDB store and manipulate strings of UTF-8 with code points outside the basic multilingual plane?
Asked Answered
B

1

8

In MongoDB 2.0.6, when attempting to store documents or query documents that contain string fields, where the value of a string include characters outside the BMP, I get a raft of errors like: "Not proper UTF-16: 55357", or "buffer too small"

What settings, changes, or recommendations are there to permit storage and query of multi-lingual strings in Mongo, particularly ones that include these characters above 0xFFFF?

Thanks.

Bissextile answered 31/7, 2012 at 19:30 Comment(12)
Can you post the exact errors you're getting? Also, what driver are you using to access MongoDB? (This could easily be a driver error.)Exaggerative
The longer form error looks like this: com.mongodb.CommandResult$CommandFailure: command failed [command failed [mapreduce] { "serverUsed" : "127.0.0.1:27017" , "assertion" : "Not proper UTF-16: 55356" , "assertionCode" : 13498 , "errmsg" : "db assertion failure" , "ok" : 0.0}Bissextile
Clearly, by the way, I should have reference UTF-16 in my question. The code it's complaining about is D83C, which I'm fairly sure is the high code of a UTF-16 surrogate pair for something in a supplementary code plane.Bissextile
According to the dependencies.groovy file in the MongoDB-GORM plugin, it's using the MongoDB Java driver version 2.7.1 ... compile("org.mongodb:mongo-java-driver:2.7.1",,excludes) ...Bissextile
Digging into the MongoDB code, I see that this error message is coming from Spider Monkey, upon failure of JS_EncodeCharacters, on or about line 205 in mongo/scripting/engine_spidermonkey.cpp. Thanks for checking on this, @WilliamZBissextile
OK @Bissextile -- I need some more data to move forward on this. The error message is referring to mapReduce(), which you didn't mention in your original question. Please let me know (using the native Java driver): (a) are you able to insert data using this UTF-16 character? (b) are you able to query data using this UTF-16 character in the query pattern? With this information I'll be able to take the next step in the diagnosis.Exaggerative
OK, @WilliamZ, I'll give it a try. May take me a bit. I suspect the core issue is inserting a string with a broken surrogate pair in it. When I added a filter to prevent such strings from being inserted, the problem apparently stopped. And, you're right, I should have mentioned that it shows up when I run a mapreduce. I haven't seen the error report on an insert or a query. But my queries are pretty rare compared to my mapreduce calls, so that may not be very indicative.Bissextile
@WilliamZ By the way, when I say "broken surrogate pair" above, I mean the high word of the pair was in the string, but the low word was probably missing. Imagine a string that ends with \uD83CBissextile
My best guess at this point is that the MongoDB engine handles this just fine, but something in the SpiderMonkey JavaScript engine breaks. The reason is that all MapReduce calls go through the JavaScript engine, while the direct CRUD operations do not. If you can create a reproducible test case, your best bet is probably to file a Jira ticket.Exaggerative
That sounds likely, given what I've seen so far. Thanks, @WilliamZBissextile
I'm not a Groovy guy :-), and I know less about UTF than I'd like. Can you post a sample of Groovy code that stores a "broken surrogate pair"? I can't move forward on my diagnosis without that.Exaggerative
I can probably describe it faster than I can code it. All it would be is to create a collection with documents that look like this (id skipped): { "name":"foobar", "score":10}} Then, insert a document that looks like this: { "name":"\uDB3C", "score":5} Finally, do a mapreduce to add up all the scores using the name as the key and you should hit the problem. The string "\uD83C" is a no-good UTF-16 string because that single 16-bit code isn't supposed to stand alone, but should be followed by another code that starts with 0xDCxx.Bissextile
E
9

There are several issues here:

1) Please be aware that MongoDB stores all documents using the BSON format. Also note that the BSON spec referes to a UTF-8 string encoding, not a UTF-16 encoding.

Ref: http://bsonspec.org/#/specification

2) All of the drivers, including the JavaScript driver in the mongo shell, should properly handle strings that are encoded as UTF-8. (If they don't then it's a bug!) Many of the drivers happen to handle UTF-16 properly, as well, although as far as I know, UTF-16 isn't officially supported.

3) When I tested this with the Python driver, MongoDB could successfully load and return a string value that contained a broken UTF-16 code pair. However, I couldn't load a broken code pair using the mongo shell, nor could I store a string containing a broken code pair into a JavaScript variable in the shell.

4) mapReduce() runs correctly on string data using a correct UTF-16 code pair, but it will generate an error when trying to run mapReduce() on string data containing a broken code pair.

It appears that the mapReduce() is failing when MongoDB is trying to convert the BSON to a JavaScript variable for use by the JavaScript engine.

5) I've filed Jira issue SERVER-6747 for this issue. Feel free to follow it and vote it up.

Exaggerative answered 10/8, 2012 at 21:39 Comment(1)
Excellent. Thank you @WilliamZ for looking into this.Bissextile

© 2022 - 2024 — McMap. All rights reserved.