Some background
I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.
I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)
Perhaps I need something like AtD, but hopefully less complex
I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.
On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)
Things I'd like to avoid
Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)
Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.
In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.
A simple solution? Scoring fluency by detecting obvious errors
If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.
Let's say we have a tiny bit of text already laid out:
existing_text = "The old river"
Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:
- Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb;
['bear', 'bears', 'bearing', 'bore', 'borne']
. - Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from
existing_text + " " + possibility
("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear". - Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.