Reconstructing now-famous 17-year-old's Markov-chain-based information-retrieval algorithm "Apodora"
Asked Answered
O

1

7

While we were all twiddling our thumbs, a 17-year-old Canadian boy has apparently found an information retrieval algorithm that:

a) performs with twice the precision of the current, and widely-used vector space model

b) is 'fairly accurate' at identifying similar words.

c) makes microsearch more accurate

Here is a good interview.

Unfortunately, there's no published paper I can find yet, but, from the snatches I remember from the graphical models and machine learning classes I took a few years ago, I think we should be able to reconstruct it from his submision abstract, and what he says about it in interviews.

From interview:

Some searches find words that appear in similar contexts. That’s pretty good, but that’s following the relationships to the first degree. My algorithm tries to follow connections further. Connections that are close are deemed more valuable. In theory, it follows connections to an infinite degree.

And the abstract puts it in context:

A novel information retrieval algorithm called "Apodora" is introduced, using limiting powers of Markov chain-like matrices to determine models for the documents and making contextual statistical inferences about the semantics of words. The system is implemented and compared to the vector space model. Especially when the query is short, the novel algorithm gives results with approximately twice the precision and has interesting applications to microsearch.

I feel like someone who knows about markov-chain-like matrices or information retrieval would immediately be able to realize what he's doing.

So: what is he doing?

Ova answered 6/8, 2011 at 15:24 Comment(6)
hmmm, I wonder how similar this is to singular value decomposition vs hmm or something? Maybe it's a series of consecutive cooccurrence matrices?Posturize
Maybe on a high level, but I doubt you can know what he's doing exactly. He's probably looking at co-occurences of higher order than first degree and modeling this by some kind of Markov process. But there may be many ways to do this.Skydive
Impossible to answer really. The comment "In theory, it follows connections to an infinite degree." suggests to me an eigendecomposition of a term co-occurrence matrix (i.e. something like Latent Semantic Analysis).Deaconess
Many algorithms can have the same abstract. Without a paper, we have only hype.Egyptology
Beyond the algorithmic stuff I have trouble believing such improved performance without better data/improved features, specifically some sort of semantic model. Machine learning is only as good as the data you feed in; ie google translate being literally the entire corpus of multiple language transcripts ever :(Posturize
Exactly, I don't quite believe it either... that's why I want to take a stab at reconstructing it, so I can test it for myself... and (if it works) re-use it in my other research.Ova
M
3

From the use of words like 'context' and the fact that he's introduced a second order level of statistical dependency, I suspect he is doing something related to the LDA-HMM method outlined in the paper: Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems. There are some inherent limits to the resolution of the search due to model averaging. However, I'm envious of doing stuff like this at 17 and I hope to heck he's done something independent and at least incrementally better. Even a different direction on the same topic would be pretty cool.

Mescaline answered 8/8, 2011 at 21:6 Comment(1)
I think you're absolutely right - it really does sound like he's modeling each document as a sequence of words generated by some markov process. Thanks for this spot-on reference!Ova

© 2022 - 2024 — McMap. All rights reserved.