Mongodb match accented characters as underlying character
Asked Answered
J

4

18

In MongoDB "db.foo.find()" syntax, how can I tell it to match all letters and their accented versions?

For example, if I have a list of names in my database:
João
François
Jesús

How would I allow a search for the strings "Joao", "Francois", or "Jesus" to match the given name?
I am hoping that I don't have to do a search like this every time:
db.names.find({name : /Fr[aã...][nñ][cç][all accented o characters][all accented i characters]s/ })

Janejanean answered 10/10, 2011 at 1:9 Comment(0)
C
24

As of Mongo 3.2, you can use $text and set $diacriticSensitive to false:

{
  $text:
    {
      $search: <string>,
      $language: <string>,
      $caseSensitive: <boolean>,
      $diacriticSensitive: <boolean>
    }
}

See more in the Mongo docs: https://docs.mongodb.com/manual/reference/operator/query/text/

Cronin answered 20/9, 2016 at 14:36 Comment(2)
I think this earns Best Answer, 5 years later, for a feature that didn't exist when I asked the question. I am no longer working on the relevant project, but I'll trust that this does what I wanted.Janejanean
Beware that text indexes do not support partial word search, therefore search for "Comp" would not match a "Company" string stored in the database. Partial search is only possible, if we have spaces between words, therefore "Company" would match "Company Tesko", but "Comp" would not match it.Papilionaceous
A
11

I suggest you add an indexed field like NameSearchable of simplified strings, e.g.

  • João -> JOAO
  • François -> FRANCOIS
  • Jesús -> JESUS
  • Jürgen -> JUERGEN

The same mapping that is used when inserting new items in the database can be used when searching. The original string with correct casing and accents will be preserved.

Most importantly, the query can make use of indexing. Case insensitive queries and regex queries can not use indexes (with the exception of rooted regexs) and will grow prohibitively slow on large collections.

Oh, and since the simplified strings can be created from the original strings, it's not a problem to add this to existing collections.

Amoebaean answered 29/2, 2012 at 13:58 Comment(3)
This suggestions makes lots of sense. Do you have any snippet that supports it?Julieannjulien
That unfortunately depends on the programming language. #249587 discusses a number of ways to do it in C#, #991404 contains JavaScript snippets (but the approach is different)Amoebaean
I made a simple implementation of this for Node users: github.com/weisjohn/mongoose-latinizeOdalisque
E
3

In this blog: http://tech.rgou.net/en/php/pesquisas-nao-sensiveis-ao-caso-e-acento-no-mongodb-e-php/

Somebody used the approach you were trying to do. This is as far as I know the only solution for the latest MongoDB version.

Endor answered 29/2, 2012 at 11:19 Comment(0)
C
0

It seems more like fuzzy matching search which mongoDb does not support currently. What you can try is:

/1. Store variations of the name in seperate element in the collection for each entry. Then the query can be run by finding if the search term exists within the variations array.

or

/2. Store soundex string for each of the names in the same collection. Then for your search string, get a soundex string , and query the database, you will get result which has similar Soundex result to your query. You can filter and verify that data more in your script. example :

Soundex code for François = F652, Soundex Code for Francois = F652

Soundex code for Jesús = J220, Soundex Code for Jesus = J220

Check more here : http://creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundExConverter

Collimore answered 10/10, 2011 at 9:50 Comment(2)
Variations won't work, because if I have a name with multiple accents I don't want to have to store every permutation of accented and nonaccented characters. Soundex looks interesting. But neither one exactly does what I want. I was hoping for a "treat accented characters as their base character" magic regex flag or something... If this does not exist, then I think I'll just have to programmatically modify the regex before it gets to mongo.Janejanean
You may find one solution here : #1891354Collimore

© 2022 - 2024 — McMap. All rights reserved.