Ruby Text Analysis
Asked Answered
S

3

13

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)

Shirberg answered 29/9, 2011 at 21:16 Comment(0)
B
9

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

Check the following thread, which contains more details and links:

Building openears compatible language model

Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

adi92's post lists some more Ruby NLP resources.

You can also Google for "ARPA Language Model" for more info

Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!

Billhook answered 29/9, 2011 at 22:3 Comment(5)
Many thanks for your answer, I'll check those resources. But both of the answers tend to encourage me to manage my own routines. Maybe downsize them a bit.Shirberg
to create a reliable statistic about n-grams, you will need one or many very large training corpus(es) of text data.... e.g. the colleciton of all WSJ articles for a given period of time could be such a corpus. Processing such large amounts is very time consuming. I love Ruby in general, but for this task a dedicated C-tool is probably better suited. Once you have the Statistics accumulated, you can use the resulting N-Gram in a Ruby Program - that's memory intensive, but not time intensive.Billhook
those training corpuses are typically domain specific.. make sure you have such text data available in large quantities, otherwise your language model will be over-fitted .. that means that it does not generalize to new data and is basically useless..Billhook
My requirements are not so harsh. I just require a basic analysis of text. I could do it like, most used word, syllabic count, statistical comparison, etc. I don't require context, semantics or anything. Just a basic understanding of vocabulary to be able to identify plurals, simple orthographic errors, words similarities and such.Shirberg
And patterns. Like idioms or grouped words like "United States", "République française" (did I mention french?) You comments are very helpful, thanks.Shirberg
B
4

The Mendicant Bug: NLP Resources for Ruby contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.

If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.

Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.

Boesch answered 29/9, 2011 at 21:31 Comment(1)
Yeah, I saw the Java thingies while searching. Looks interesting but heck, I'm a Ruby fanboy ^^ Hoped that there would be some simple stuff that would save me the time needed to develop simple analysis, statistics and so on. Thanks for your answer.Shirberg
C
0

I wrote the gem words_counted for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.

Chromatism answered 27/10, 2014 at 19:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.