Getting weird markup from Google translate like ~~POS=TRUNC
Asked Answered
B

2

8

I'm suddenly getting same strange markup when translating phrases in Google Translate API via the Java library. Examples for English → Swedish include:

Vector graphics → vektor~~POS=TRUNC grafikk~~POS=HEADCOMP

Javascript → Javascript script~~POS=HEADCOMP

It looks like it's related to compound noun handling. Is this a feature of the API that I can deactivate somehow or is this a new bug on the server side?

Blunderbuss answered 9/11, 2016 at 11:32 Comment(1)
This is the translator serverside bug. API is ok.Hyaline
P
1

This looks like a bug in the server-side translator. I also get it on the web site, https://translate.google.com/#view=home&op=translate&sl=ru&tl=no&text=%D0%9E%D0%B1%D1%89%D0%B5%D0%B6%D0%B8%D1%82%D0%B8%D0%B5 gives me vandrer~~POS=TRUNC.

In NLP, "POS" means Part-Of-Speech, "HEADCOMP" sounds like it could be the head of a noun-compound, I'm guessing they TRUNCate the non-head parts of compounds (practically never inflected). So Google Translate is spilling some of its internals. What's surprising is that such tags are the staple of rule-based/knowledge-based systems, whereas Google typically only does pure machine learning methods, shunning hard-coded knowledge. (One possibility is that they used a noun-compound analyser to expand their training set (which they then ran ML on, similar to how Systran & Koehn trained statistical MT on a parallel corpus translated with a rule-based MT system), but had a bug in the script to clean up the tags before training.)

It'd be fun to find out which system they used, in case it was an open source one, but unfortunately the tags are practically ungoogleable, since the web is now littered with spammy machine translated (and non-post-edited) pages full of those tags.

Perishable answered 19/2, 2020 at 8:48 Comment(0)
D
1

It seems it has to do with the way Google "translates" strings, returning what is statistically most likely correct. Common Unix commands might therefor end up in your translation.

More discussion about the topic: https://www.reddit.com/r/German/comments/47kfah/thanks_google/

Devanagari answered 20/4, 2017 at 10:52 Comment(0)
P
1

This looks like a bug in the server-side translator. I also get it on the web site, https://translate.google.com/#view=home&op=translate&sl=ru&tl=no&text=%D0%9E%D0%B1%D1%89%D0%B5%D0%B6%D0%B8%D1%82%D0%B8%D0%B5 gives me vandrer~~POS=TRUNC.

In NLP, "POS" means Part-Of-Speech, "HEADCOMP" sounds like it could be the head of a noun-compound, I'm guessing they TRUNCate the non-head parts of compounds (practically never inflected). So Google Translate is spilling some of its internals. What's surprising is that such tags are the staple of rule-based/knowledge-based systems, whereas Google typically only does pure machine learning methods, shunning hard-coded knowledge. (One possibility is that they used a noun-compound analyser to expand their training set (which they then ran ML on, similar to how Systran & Koehn trained statistical MT on a parallel corpus translated with a rule-based MT system), but had a bug in the script to clean up the tags before training.)

It'd be fun to find out which system they used, in case it was an open source one, but unfortunately the tags are practically ungoogleable, since the web is now littered with spammy machine translated (and non-post-edited) pages full of those tags.

Perishable answered 19/2, 2020 at 8:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.