According to my understanding, Distant Supervision is the process of specifying the concept which the individual words of a passage, usually a sentence, are trying to convey.
For example, a database maintains the structured relationship concerns( NLP, this sentence).
Our distant supervision system would take as input the sentence: "This is a sentence about NLP."
Based on this sentence it would recognize the entities, since as a pre-processing step the sentence would have been passed through a named-entity recognizer, NLP
& this sentence
.
Since our database has it that NLP
and this sentence
are related by the bond of concern(s)
it would identify the input sentence as expressing the relationship Concerns(NLP, this sentence)
.
My questions is two fold:
1) What is the use of that? Is it that later our system might see a sentence in "the wild" such as That sentence is about OPP
and realize that it's seen something similar to that before and thereby realize the novel relationship such that concerns(OPP, that sentence).
, based only on the words/ individual tokens?
2) Does it take into account the actual words of the sentence? The verb 'is' and the adverb 'about' for instance, realizing (through WordNet or some other hyponymy system) that this is somehow similar to the higher-order concept "concerns"?
Does anyone have some code used to generate a distant supervision system that I could look at, i.e. a system that cross references a KB, such as Freebase, and a corpus, such as the NYTimes, and produces a distant supervision database? I think that would go a long way in clarifying my conception of distant supervision.