MongoDB: how to find documents ignoring case sensitive, accents and percent like logic (%)
Asked Answered
T

4

10

I would like to make a search on a collection in my mongodb database. In my collection, I have documents with the field "name" can be values like:

[i] "Palácio Guanabara", "Palácio da Cidade", "Festa Palácio", etc.

When a user types a search like "pala" or "palá" or "Pala" or "PalÁ", all those itens in [i] must build the result set.

I found that in MongoDB I could use regex in searches, like:

{ "name": { $regex: new Regex(".*pala.*", "i") } }

Ok, this approach is case insensitive and use the percent like logic from SQL ("%pala%"). But, it isn't ignore accents from the register in database.

I found another alternative with the $text index: https://docs.mongodb.org/manual/core/index-text/

This approach can ignore case sensitive and accents. But the "search" does not accepts a regex, so I can't search things like "%pala%".

Summing up, I want to make the following SQL query in MongoDB:

select * from collection where remove_accents(upper(name)) like '%Pala%'

And this query returning results with name like "palácio", "palacio", "PaláCiô", etc.

Tachylyte answered 15/4, 2016 at 12:31 Comment(2)
As suggested here #7708171 your probable best bet is to created a field with an unaccented version of your searchable string.Infliction
Thanks @Infliction . I'll use the Dmitriy approach and in the future I'll create this filed "searchable".Tachylyte
V
2

There is no magic bullet here inside of MongoDb. But since you obviously changing user input anyway to create '%pala%', why not replace "a" with "[aá]" and wrap in ".*", this way you can use regex and have your diacritics.

Here are options not much work to create replaces.

French Letters [a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]

German Letters The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark. [a-zA-ZäöüßÄÖÜẞ]

Polish Letters [a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ] Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters [a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]

Spanish Letters [a-zA-ZáéíñóúüÁÉÍÑÓÚÜ] from http://www.rexegg.com/regex-interesting-character-classes.html#languages

Vehemence answered 15/4, 2016 at 12:53 Comment(1)
Thanks for helping. In this moment, I'll use this approach to solve my problem. It works, I used this useful function in this postTachylyte
C
17

This works for me!

Just replace vowels with a regex which contains accents vowels. Additionally can use $options: 'i' to avoid case sensitive.

function diacriticSensitiveRegex(string = '') {
     return string
        .replace(/a/g, '[a,á,à,ä,â]')
        .replace(/A/g, '[A,a,á,à,ä,â]')
        .replace(/e/g, '[e,é,ë,è]')
        .replace(/E/g, '[E,e,é,ë,è]')
        .replace(/i/g, '[i,í,ï,ì]')
        .replace(/I/g, '[I,i,í,ï,ì]')
        .replace(/o/g, '[o,ó,ö,ò]')
        .replace(/O/g, '[O,o,ó,ö,ò]')
        .replace(/u/g, '[u,ü,ú,ù]')
        .replace(/U/g, '[U,u,ü,ú,ù]');
    }

find ({ name: { $regex: diacriticSensitiveRegex('pala'), $options: 'i' } });
Caldarium answered 24/4, 2019 at 11:44 Comment(2)
Do you know if having a more complex regex will impact query performances ? Should i rather store a deburred version or use your diacriticSensitiveRegex ? Thank you !District
to avoid case sensitive, you have to improve this by adding uppercase case for each vowel ; example : .replace(/A/g, '[A,a,á,à,ä,â]') . I'll try to update answer.Cavesson
D
8

what happened if you use just :

find({name: {$regex: 'pala', $options: "i"}})

you used new Regex() that may not valid constructor valid constructor is new RegExp()

Delete answered 15/4, 2016 at 13:16 Comment(1)
option i only works for case insensitive case, not for the accents.Landed
V
2

There is no magic bullet here inside of MongoDb. But since you obviously changing user input anyway to create '%pala%', why not replace "a" with "[aá]" and wrap in ".*", this way you can use regex and have your diacritics.

Here are options not much work to create replaces.

French Letters [a-zA-ZàâäôéèëêïîçùûüÿæœÀÂÄÔÉÈËÊÏΟÇÙÛÜÆŒ]

German Letters The controversial capital letter for ß, now included in unicode, is missing in many fonts, so it might show on your screen as a question mark. [a-zA-ZäöüßÄÖÜẞ]

Polish Letters [a-pr-uwy-zA-PR-UWY-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ] Note that there is no Q, V and X in Polish. But if you want to allow all English letters as well, use [a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]

Italian Letters [a-zA-ZàèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]

Spanish Letters [a-zA-ZáéíñóúüÁÉÍÑÓÚÜ] from http://www.rexegg.com/regex-interesting-character-classes.html#languages

Vehemence answered 15/4, 2016 at 12:53 Comment(1)
Thanks for helping. In this moment, I'll use this approach to solve my problem. It works, I used this useful function in this postTachylyte
B
1

In any reasonably modern version of Mongo (3.6 and beyond iirc) you can add a text index to a field and search using a $text search. E.g.

MyCollection.createIndex({ "someTextField": "text" });

Putting a text index on the field. Then use the $text search. This is pseudo code and not actual code. If you have multiple fields and need to search a single field this won't work for you.

Document doc = MyCollection.find({ $text: {
    $search: "nino",
    $caseSensitive: false,
    $diacriticSensitive: false
}}).first();
Behka answered 28/2, 2023 at 23:19 Comment(1)
but this is only for whole word searchUmbel

© 2022 - 2024 — McMap. All rights reserved.