what is distant supervision?

Asked 11/4, 2015 at 8:29 Answered 19/3, 2016 at 22:47

Solved nlp stanford-nlp supervised-learning unsupervised-learning

According to my understanding, Distant Supervision is the process of specifying the concept which the individual words of a passage, usually a sentence, are trying to convey.

For example, a database maintains the structured relationship concerns( NLP, this sentence).

Our distant supervision system would take as input the sentence: "This is a sentence about NLP."

Based on this sentence it would recognize the entities, since as a pre-processing step the sentence would have been passed through a named-entity recognizer, NLP & this sentence.

Since our database has it that NLP and this sentence are related by the bond of concern(s) it would identify the input sentence as expressing the relationship Concerns(NLP, this sentence).

My questions is two fold:

1) What is the use of that? Is it that later our system might see a sentence in "the wild" such as That sentence is about OPP and realize that it's seen something similar to that before and thereby realize the novel relationship such that concerns(OPP, that sentence)., based only on the words/ individual tokens?

2) Does it take into account the actual words of the sentence? The verb 'is' and the adverb 'about' for instance, realizing (through WordNet or some other hyponymy system) that this is somehow similar to the higher-order concept "concerns"?

Does anyone have some code used to generate a distant supervision system that I could look at, i.e. a system that cross references a KB, such as Freebase, and a corpus, such as the NYTimes, and produces a distant supervision database? I think that would go a long way in clarifying my conception of distant supervision.

Eyrie answered 11/4, 2015 at 8:29 Comment(0)

RE 1) Yes, this is exactly right. In the end, what we want is a classifier that takes as input text, and a pair of entity mentions in the text, and tells us what relation holds between those entities in that sentence. Distant supervision is a way of mocking this training data, using "distant supervision" from a known knowledge base. But, the end goal is the same as most machine learning tasks: generalize to new sentences.

RE 2) Certainly! Distant supervision only applies to how the training data is generated [1]. Once you've assumed distant supervision, what you're left with is a corpus of (sentence, relation_for_sentence) pairs, and then you extract all of the usual NLP features on the sentence.

[1] To a first approximation -- there are "distantly supervised" models (like MultiR and MIML-RE) which don't directly generate fake training data, but incorporate the supervision indirectly into the training procedure itself. But, even in these, there is a factor in the latent-variable model that amounts to a per-sentence classification, and it's just that the output variable is latent rather than naively "observed" as in vanilla distant supervision.

Bivins answered 13/4, 2015 at 2:0 Comment(0)

according to my understanding now- the real value of distant supervision is that we can use it to annotate a big corpus without having to manually consider each sentence- since this is very expensive in terms of person hours- so in the end some of the recognized relationships in sentences will be false- but it will be- hopefully "pretty good"... which is useful- in some applications such as... academics competing with eachother to try to get marginally better scores on this silly task and... other things such as... (examples are welcome)

Eyrie answered 19/3, 2016 at 22:47 Comment(0)

Recommended topics

Hot tags