Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start? [closed]

Asked 15/3, 2012 at 13:54 Answered 15/2, 2019 at 18:31

Solved javascript nlp coffeescript user-experience tokenize

My current web-app project calls for a little NLP:

Tokenizing text into sentences, via Punkt and similar;
Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not)
A Bayesian model fit for chunking paragraphs with an even feel, no orphans or widows and minimal awkward splits (maybe)

... which much of that is a childishly easy task if you’ve got NLTK — which I do, sort of: the app backend is Django on Tornado; you’d think doing these things would be a non-issue.

However, I’ve got to interactively provide the user feedback for which the tokenizers are necessitated, so I need to do tokenize the data clientside.

Right now I actually am using NLTK, via a REST API call to a Tornado process that wraps the NLTK function and little else. At the moment, things like latency and concurrency are obviously suboptimal w/r/t this ad-hoc service, to put it politely. What I should be doing, I think, is getting my hands on Coffee/Java versions of this function if not reimplementing it myself.

And but so then from what I've seen, JavaScript hasn’t been considered cool long enough to have accumulated the not-just-web-specific, general-purpose library schmorgasbörd one can find in C or Python (or even Erlang). NLTK of course is a standout project by anyones’ measure but I only need a few percent of what it is packing.

But so now I am at a crossroads — I have to double down on either:

The “learning scientific JavaScript technique fit for reimplementing algorithms I am Facebook friends with at best” plan, or:
The less interesting but more deterministically doable “settle for tokenizing over the wire, but overcompensate for the dearth of speed and programming interestingness — ensure a beachball-free UX by elevating a function call into a robustly performant paragon of web-scale service architecture, making Facebook look like Google+” option.

Or something else entirely. What should I do? Like to start things off. This is my question. I’m open to solutions involving an atypical approach — as long as your recommendation is not distasteful (e.g. “use Silverlight”) and/or a time vortex (e.g. “get a computational linguistics PhD you troglodyte”) I am game. Thank you in advance.

Conclave answered 15/3, 2012 at 13:54 Comment(1)

Another thing I forgot to mention that might factor against a clientside JavaScripty solution: NLTK, like other statistical models I've encountered, often need to sit on top of a giant pile of training data to work (the Punkt tokenizer I'm using has such a requirement)... I could be wrong about this (in fact that would be nice) but so obvi a solution in which the client has to d/l a corpus isn't feasable here. – Conclave 15/3, 2012 at 14:0

I think that, as you wrote in the comment, the amount of data needed for efficient algorithms to run will eventually prevent you from doing things client-side. Even basic processing require lots of data, for instance bigram/trigram frequencies, etc. On the other hand, symbolic approaches also need significant data (grammar rules, dictionaries, etc.). From my experience, you can't run a good NLP process without at the very least 3MB to 5MB of data, which I think is too big for today's clients.

So I would do things over the wire. For that I would recommend an asynchronous/push approach, maybe use Faye or Socket.io ? I'm sure you can achieve a perfect and fluid UX as long as the user is not stuck while the client is waiting for the server to process the text.

Loculus answered 15/3, 2012 at 15:14 Comment(1)

Indeed, it does look like the way to go. Thanks especially for recommending Faye -- it's something I hadn't looked at, but it appears to be a good match for this. – Conclave 23/3, 2012 at 8:13

There is a quite nice natural language processing for node.js called natural. It's not currently built for running in the browser, but the authors have stated that they want to fix that. Most of it might even work already, using something like browserify or Require.JS.

Topminnow answered 15/3, 2012 at 16:6 Comment(1)

Thanks for the tips: on natural, which looks like a good package to watch; and also browserify, of which I was also unaware. – Conclave 23/3, 2012 at 8:11

winkjs has several packages for natural language processing:

Multilingual tokenizer that tags each token with its type such as word, number, email, mention, etc.
English Part-of-speech (POS) tagger,
Language agnostic named entity recognizer,
Useful functions for common NLP tasks and many more e.g. sentiment analysis, lemmatizer, naive bayes text classifier, etc.

It has packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Lindeberg answered 15/2, 2019 at 18:31 Comment(0)

I think you should deploy a separate service independent from the rest of your app which does most of the work server side but can send multiple options to the client depending on what it thinks the client will type in next. When I read about your reqs, I think of the search autocomplete feature of sites like Google, Quora and Yelp. You might have typed in only 3 or 4 characters in the search box, but these services would have sent multiple query suggestions based on what they think you will type in next. If you are dynamically tokenizing text, you can have some sort of ngram model (or other more sophisticated language model) be able to guess when the sentence is going to end and tell the frontend what to do for the k most likely future outcomes. Basically have a backend service that can precompute/cache lots of outcomes and have a semi-smart frontend that can check to see if the current state of the user input matches on of the predicted states sent by the backend a few 100 milliseconds earlier and seemingly instantaneously do the right thing in front of the client without hanging up their browser trying to do some memory/computation intensive actions right in there.

The two options you have presented are
1) doing everything client-side which might be fast but very complicated to do due to the lack of existing nlp js libraries

2) doing everything server-side which might easier, but making your application seem laggy to the user

I am asking you to do

3) doing everything server-side but thinking ahead a few steps, and sending multiple options to the client so that the work gets done in a place where its easier for you to do it, but the client feels like its happening instantaneously.

Idiot answered 16/3, 2012 at 4:42 Comment(1)

Can you clarify what you mean by "multiple options to the client" -- you used the phrase twice; my question is about the strategy w/r/t where to put the NLP function call(s), but insofar as my current implementation goes, the options to those calls hasn't been at issue. – Conclave 24/3, 2012 at 15:27

Recommended topics

Hot tags