Latest Pre-trained Multilingual Word Embedding
Asked Answered
B

2

6

Are there any latest pre-trained multilingual word embeddings (multiple languages are jointly mapped to a same vector space)?

I have looked at the following but they don't fit my needs:

  1. FastText / MUSE (https://fasttext.cc/docs/en/aligned-vectors.html): this one seems too old, and the word vectors are not using subwords / wordpiece information.
  2. LASER (https://github.com/yannvgn/laserembeddings): I'm now using this one, it's using subword information (via BPE), however, it's suggested that not to use this for word embedding because it's designed to embed sentences (https://github.com/facebookresearch/LASER/issues/69).
  3. BERT multilingual (bert-base-multilingual-uncased in https://huggingface.co/transformers/pretrained_models.html): it's contextualised embeddings that can be used to embed sentences, and seems not good at embedding words without contexts.

Here is the problem I'm trying to solve:

I have a list of company names, which can be in any language (mainly English), and I have a list of keywords in English to measure how close a given company name is with regards to the keywords. Now I have a simple keyword matching solution, but I want to improve it using pretrained embeddings. As you can see in the following examples, there are several challenges:

  1. keyword and brand name is not separated by space (now I'm using package "wordsegment" to split words into subwords), so embedding with subword info should help a lot
  2. keyword list is not extensive and company name could be in different languages (that's why I want to use embedding, because "soccer" is close to "football")

Examples of company names: "cheapfootball ltd.", "wholesalefootball ltd.", "footballer ltd.", "soccershop ltd."

Examples of keywords: "football"

Bundestag answered 15/6, 2020 at 9:13 Comment(3)
the word vectors are not using subwords / wordpiece information. - No, fasttext-based word embeddings are created using n-grams sub-words. See: github.com/facebookresearch/fastText/issues/475Jovitta
you are right that most fasttext based word embeddings are using subwords, especially the ones that can be loaded by "fasttext.load_model", however, the one I was referring to (fasttext.cc/docs/en/aligned-vectors.html) only has "text" format, and it's not using subwords information.Bundestag
Sorry, didn't see that your link pointed to "aligned word vectors" :)Jovitta
J
3

Check if this would do:


If you're okay with whole word embeddings:
(Both of these are somewhat old, but putting it here in-case it helps someone)


If you're okay with contextual embeddings:


You can even try using the (sentence-piece tokenized) non-contextual input word embeddings instead of the output contextual embeddings, of the multilingual transformer implementations like XLM-R or mBERT. (Not sure how it will perform)

Jovitta answered 27/7, 2020 at 11:58 Comment(0)
S
0

I think it might be a little misleading to build a model using embedding into this application(learned by experience). Because if there are two companies, football ltd, and soccer ltd, the model might say both are a match, which might not be right. One approach is to remove redundant words, i.e., corporation from the Facebook corporation, ltd from Facebook ltd and try matching.

Another approach is to use deepmatcher, which uses deep learning fuzzy matching based on words context. Link

If the sentence similarity is the primary approach you want to follow STSBenchmark algorithms might be worth exploring :Link

Sent2vec link and InferSent Link uses Fasttext but seems to have good results on STSBenchmark

Samora answered 18/6, 2020 at 16:18 Comment(1)
Thanks for your reply, but I'm not trying to match company names with company names, but trying to match company names against a certain topic (e.g. football in this case). Also I already did preprocess e.g. removing company types (ltd) from names before applying the embddings. I'll look into your links to see if there are something I can try.Bundestag

© 2022 - 2024 — McMap. All rights reserved.