Use multiple stemming languages with ElasticSearch
Asked Answered
G

5

5

I'm building a search engine for a website where users can be of many different countries and post text content.

I'll consider that: - A french generates content in french and english - A german generates content in german and english etc...

What i'd like to know if it is possible to make a search using different snowball stemmer langages in the same time, so that we have appropriate results in the same time.

Do we have to create one index per snowball stemmer langage?

Is there a known pattern for such a case?

Thanks

Germanism answered 14/6, 2012 at 22:16 Comment(2)
Not sure I understand what you want here. You are trying to search multiple languages with a single query... and return mixed results (results in multiple languages)?Coronach
yes, on a single search field, i'd like to be able to retrieve documents in multiple languages (basically 2, the user country language, and english)Germanism
G
1

This new ElasticSearch plugin works fine:

https://github.com/yakaz/elasticsearch-analysis-combo

Germanism answered 9/8, 2012 at 21:15 Comment(1)
Yes, still applicable 12 years later. But this doesn't address the question of the structure of the mappings or how you prepare the (Lucene) documents you're actually entering into the index. See my comments to David Kong's answer.Singletree
A
3

Earlier this year Kiju Kim from the elasticsearch team published some good articles on the topic how to work with multiple languages on the elastic.co blog:

You can basically use multiple fields for your content - one for each language you want to support (Part 2) - each utilising language specific analyzers (Part 1). (Part 3) adds some optimisation to use language detection to populate the correct language field instead of all fields making use of an ingest pipeline (using an ingest plugin for language detection).

Akee answered 23/12, 2018 at 15:17 Comment(0)
C
2

So quick disclaimer, I'm not an expert in stemming/language morphology but since noone else is responding, here's my understanding. Also, most of my experience is along the lines of solr.

In order to be able to query with stemming against multiple languages with a single, mixed result set, you need to use a multilingual stemmer. I'm not sure what is available for elastisearch.

Trying to apply multiple stemmers designed for single languages to a single index will step on each other's toes and likely not produce expected results (stemming rules vary significantly depending on the language).

Having an index per language with respective stemmers works for queries with single language results. Trying to combine results from multiple queries against multiple indices is usually fairly problematic (you have to attempt to normalize relevancy and deal with paging).

Coronach answered 15/6, 2012 at 18:42 Comment(1)
Thanks. I asked ElasticSearch experts of my company and it seems we can use a multilingual stemmer if the document is able the provide the language to use. But for using 2 stemmers for the same document, i don't know yet. it's not always easy to compute the language of a document thus i wanted to index in multiple languages the same documentGermanism
A
2

You can create 2 separate indices and search on both ( or all ) at the same time. As long as fields of indices are the same you will get valid results.

Amaras answered 22/6, 2012 at 21:44 Comment(0)
G
1

This new ElasticSearch plugin works fine:

https://github.com/yakaz/elasticsearch-analysis-combo

Germanism answered 9/8, 2012 at 21:15 Comment(1)
Yes, still applicable 12 years later. But this doesn't address the question of the structure of the mappings or how you prepare the (Lucene) documents you're actually entering into the index. See my comments to David Kong's answer.Singletree
F
0

You can combine stemmers. I assume there will be conflicts and order will matter. Wonder how big of a problem that is.

"settings": {
    "index": {
        "analysis": {
            "filter": {
                "german_stemmer": {
                    "type": "stemmer",
                    "name": "light_german"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "name": "english"
                },
                "french_stemmer": {
                    "type": "stemmer",
                    "name": "light_french"
                },
                "italian_stemmer": {
                    "type": "stemmer",
                    "name": "light_italian"
                }
            }
            "analyzer": {
                "asdfghjkl": {
                    "tokenizer": "standard",
                    "filter": [
                        "english_stemmer",
                        "italian_stemmer",
                        "french_stemmer",
                        "german_stemmer"
                    ]
                }
            }
        }
    }
}
Fogg answered 28/11, 2019 at 4:57 Comment(3)
I don't this is a desirable approach. You need to have separate fields for each language... and then an appropriate stemmer field for each of them. When compiling the index you need to identify the language of the text at any one time and put each (Lucene) document into the index using the right field. It is not just bad if you try to apply an Italian stemmed query to an English stemmed field, it is disastrous.Singletree
If the Lucene documents you enter into the index are multilingual themselves, even assuming it is possible to separate the languages when compiling the index ("lingua" crate in Rust for example), then what you need to do is create specific versions. Say you have some Italian and some English: in the Italian version all the non-Italian text characters should be carefully replaced with placeholder characters (e.g. "?", or Unicode +FFFC), and in the English version all the non-English text characters should be replaced with placeholder characters.Singletree
That will of course deliver results full of "?" characters. Assuming you are using highlighting, what you need to do then, in the receiving code, is to process the original text (i.e. under _source), to insert all the HTML highlight markup (e.g. <em>, </em>) into that original text from the delivered highlighted results. It's quite complicated but not impossible.Singletree

© 2022 - 2024 — McMap. All rights reserved.