We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster.
Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string
query.
We do support some text-related OData filter expressions, such as:
startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'
The fields we're matching against can be analyzed
, not_analyzed
or both (i.e. via a multi-field).
The searched text can be a single token (e.g. table
), only a part thereof (e.g. tab
), or several tokens (e.g. table 1.
, table 10
, etc).
The search must be case-insensitive.
Here are some examples of the behavior we need to support:
startswith(name,'table 1')
must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"endswith(name,'table 1')
must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"substringof('table 1',name)
must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"name eq 'table 1'
must match "Table 1", "TABLE 1", "table 1"
So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith
/endswith
, resp. the 1st parameter of substringof
, resp. the right-hand side value of the eq
) and try to match it exactly, whether the tokens fully match or only partially.
Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.
In our query_string
, we match against a not_analyzed
field using the Regular Expression syntax. Since the field is not_analyzed
and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8')
(=> match all documents whose name
ends with "table 8")
"query": {
"query_string": {
"query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
"lowercase_expanded_terms": false,
"analyze_wildcard": true
}
}
So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.
My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name)
to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.
Thanks in advance for sharing your braincells on this one.
UPDATE 1: after testing Andrei's solution
=> Exact match (eq
) and startswith
work perfectly.
A. endswith
glitches
Searching for substringof('table 112', name)
yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112')
yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.
B. substringof
glitches
Searching for substringof('table',name)
matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name)
doesn't match anything
UPDATE 2
It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string
query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!
UPDATE 3 (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type
that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
\d+
would be[0-9]+
. You could even get a little fancy with unrolled-loop method. – Pasteboardsearch_as_you_type
, if you tried going that route. – Pianism