App Engine Search API (Document Search) - Multiple Languages
Asked Answered
T

2

6

I have Documents that I'd like to make searchable in 3 different languages. Since I can have multiple fields with the same name/type, the following Document structure works (this is a simplified example).

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
      search.TextField(
        name="name",
        language="es",
        value="perro"),
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ]
  )
  index = search.Index("my_index")
  index.put(document)

Specifying the language helps Google tokenize the value of the TextField.

The following queries all work, each returning one result:

print index.search("name: dog")
print index.search("name: perro")
print index.search("name: chien")

Here is my question: Can I restrict a search to only target fields with a specific language?

The purpose is to avoid getting false positive results. Since each language uses the Arabic alphabet, it's possible that someone performing a full text search in Spanish may see English results that are not relevant.

Thank you.

Tortosa answered 22/6, 2017 at 6:57 Comment(1)
calling google translate api for language detection and using result in query: get_index(lang-detected).search(query) or translating search term to stored data language and searching based on translation resultUnbolted
B
2

You can use facets to add fields to a document that don't actually appear in the document (metadata). These would indicate what languages appear in the document.

Document insertion:

    index = search.Index("my_index")
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="en",
            value="dog"),
          search.TextField(
            name="name",
            language="es",
            value="perro"),
          search.TextField(
            name="name",
            language="fr",
            value="chien")
        ],
        facets=[
           search.AtomFacet(name='lang', value='en'),
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="es",
            value="gato"),
          search.TextField(
            name="name",
            language="fr",
            value="chat")
        ],
        facets=[
           # no english in this document so leave out lang='en'
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)

Query:

index = search.Index("my_index")
query = search.Query(
    '', # query all documents, cats and dogs.
    # filter docs by language facet
    facet_refinements=[
        search.FacetRefinement('lang', value='en'),
    ])

results = index.search(query)
for doc in results:
    result = {}
    for f in doc.fields:
        # filter fields by language
        if f.language == 'en':
            result[f.name] = f.value
    print result

Should print {u'name': u'dog'}.

Note that although we can fetch only documents that have english in them, we still have to filter out the fields in other languages in those documents. This why we iterate through the fields only adding those in english to result.

If you want to know more about the more general use case for faceted search, this answer gives a pretty good idea.

Broderick answered 24/6, 2017 at 7:40 Comment(9)
Turns out you need to use FacetRefinements instead of FacetRequests. The former is to select documents by facets the latter only gives you information on what facets are available.Broderick
document = search.Document( doc_id=str("1"), fields=[ search.TextField(language="en", name="name", value="one"), search.TextField(language="es", name="name", value="uno") ]) index.put(document) document = search.Document( doc_id=str("2"), fields=[ search.TextField(language="en", name="name", value="uno"), search.TextField(language="es", name="name", value="one") ]) index.put(document) index.search(search.Query( "name: one", facet_refinements=[ search.FacetRefinement("lang", value="en") ]))Tortosa
The above code is quite gross, but SO won't let me format it in the comment. It's a case where FacetRefinement returns zero results, despite having a match. Do you know why?Tortosa
@user326502 that's because you didn't add a facets parameter (or accompanying AtomFacets to your document).Broderick
When I add AtomFacets to my document, then query with a FacetRefinement of lang="en", both documents are returned. Which isn't really what I'm looking for. I'm trying to filter out the documents where the field has a match, but the language does not.Tortosa
To clarify, I'm trying to search only the English fields, and none of the other ones.Tortosa
@user326502 the idea is to put only AtomFacet(name='lang', value='en'), in documents where you have TextField with language='en'.Broderick
@user326502 I extended my example to provide clarificationBroderick
Thanks. Each document will have all 3 of the same languages. I suppose I could add separate documents for each distinct language. That changes how I planned on assigning doc_id but that's not a big deal.Tortosa
B
2

You could use a separate index for each language.

Define a utility function for resolving the correct index for a given language:

def get_index(lang):
   return search.Index("my_index_{}".format(lang))

Insert documents:

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
    ])

get_index('en').put(document)

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ])

get_index('fr').put(document)

Query by language:

query = search.Query(
    'name: chien')

results = get_index('fr').search(query)

for doc in results:
    print doc
Broderick answered 28/6, 2017 at 16:24 Comment(1)
I took a similar approach by using separate fields for each language, and then appending the language code to the search field name. That's my fallback approach, but I'm hoping to find a better solution here.Tortosa

© 2022 - 2024 — McMap. All rights reserved.