My current web-app project calls for a little NLP:
- Tokenizing text into sentences, via Punkt and similar;
- Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not)
- A Bayesian model fit for chunking paragraphs with an even feel, no orphans or widows and minimal awkward splits (maybe)
... which much of that is a childishly easy task if you’ve got NLTK — which I do, sort of: the app backend is Django on Tornado; you’d think doing these things would be a non-issue.
However, I’ve got to interactively provide the user feedback for which the tokenizers are necessitated, so I need to do tokenize the data clientside.
Right now I actually am using NLTK, via a REST API call to a Tornado process that wraps the NLTK function and little else. At the moment, things like latency and concurrency are obviously suboptimal w/r/t this ad-hoc service, to put it politely. What I should be doing, I think, is getting my hands on Coffee/Java versions of this function if not reimplementing it myself.
And but so then from what I've seen, JavaScript hasn’t been considered cool long enough to have accumulated the not-just-web-specific, general-purpose library schmorgasbörd one can find in C or Python (or even Erlang). NLTK of course is a standout project by anyones’ measure but I only need a few percent of what it is packing.
But so now I am at a crossroads — I have to double down on either:
- The “learning scientific JavaScript technique fit for reimplementing algorithms I am Facebook friends with at best” plan, or:
- The less interesting but more deterministically doable “settle for tokenizing over the wire, but overcompensate for the dearth of speed and programming interestingness — ensure a beachball-free UX by elevating a function call into a robustly performant paragon of web-scale service architecture, making Facebook look like Google+” option.
Or something else entirely. What should I do? Like to start things off. This is my question. I’m open to solutions involving an atypical approach — as long as your recommendation is not distasteful (e.g. “use Silverlight”) and/or a time vortex (e.g. “get a computational linguistics PhD you troglodyte”) I am game. Thank you in advance.