Use brain.js neural network to do text analysis
Asked Answered
B

3

15

I'm trying to do some text analysis to determine if a given string is... talking about politics. I'm thinking I could create a neural network where the input is either a string or a list of words (ordering might matter?) and the output is whether the string is about politics.

However the brain.js library only takes inputs of a number between 0 and 1 or an array of numbers between 0 and 1. How can I coerce my data in such a way that I can achieve the task?

Blinders answered 5/5, 2016 at 6:10 Comment(0)
H
17
new brain.recurrent.LSTM(); 

this does the trick for you.

Example,

var brain = require('brain.js')
var net = new brain.recurrent.LSTM();
net.train([
  {input: "my unit-tests failed.", output: "software"},
  {input: "tried the program, but it was buggy.", output: "software"},
  {input: "i need a new power supply.", output: "hardware"},
  {input: "the drive has a 2TB capacity.", output: "hardware"},
  {input: "unit-tests", output: "software"},
  {input: "program", output: "software"},
  {input: "power supply", output: "hardware"},
  {input: "drive", output: "hardware"},
]);

console.log("output = "+net.run("drive"));


output = hardware

refer to this link=> https://github.com/BrainJS/brain.js/issues/65 this has clear explanation and usage of brain.recurrent.LSTM()

Housebound answered 17/3, 2018 at 19:18 Comment(6)
The reason this works, and works well, is because each character represents a neuron in the net. Once you offset a representation of the net's values via a representative neuron, you can feed pretty much anything into a neural network.Radiotelegram
Hear that? ...that's the sound of my mind exploding. Thank you for your answer!Lens
@Lens glad that it helpedHousebound
Is there a known limit of how many categories (two categories in this case) you can have, where this approach fails if there are to many?Carreon
@RobertPlummer would not call this 'working well' if you input buy me a driver it will just print out text character.Ellamaeellan
@Ellamaeellan It does work well, it's just that the data provided above isn't enough, add more data that is descriptive and accurate, get better trained model.Schiffman
N
2

You need to come up with the model to convert your data to a list of tuples [input, expected_output], where input is a list of numbers between 0 and 1 representing the given words, and output is one number between 0 and 1 representing how close the sentence is to your objective analysis (being political). For example, for the sentence "The quick brown cat jumped over the lazy dog" you might want to give a score of zero. A sentence like "President shakes off corruption scandal" you might want to give a score very close to one.

As you can see, your biggest challenge is actually obtaining the data and cleaning it. Converting it to the training format is easy, you could just hash words into numbers between 0 and 1, and make sure to handle different casing, punctuation, and you might want to step words to get the best results.

One more thing, you can use a term relevance algorithm to rank the importance of words in your training data set, so that you can choose only the top k relevant words in a sentence, since you need uniform data size for each sentence.

Notum answered 5/5, 2016 at 6:26 Comment(4)
I don't think this would work because the number between 0 and 1 is supposed to be continuous. Meaning "fox" might hash to 0.492 and "president" might hash to 0.493 and to the neural net these inputs are really similar but in reality they aren't. I'm looking into NLP now.Blinders
@arasmussen it doesn't matter if the hashes are close for different words, as long as they're different. The NN only needs to get different numbers for different words, then it'll do the association on its own. Your only problem here is if "fox" and "president" somehow hash to the exact same value, but you can get around that if you choose a good hash function.Notum
I don't think that's correct. Do you have a source?Blinders
Unfortunately I don't, it's just my intuition. NN isn't the best tool for this sort of thing anyway, but it would be good to give it a try and see what comes up. Have some fun using NLTK or some similar tools to lemmatize the text and feed it to the NN and see what comes out.Notum
B
1

So apparently text doesn't coerce very well to NN input.

A Naive Bayes Classifier looks like exactly what I want. https://github.com/harthur/classifier

Blinders answered 5/5, 2016 at 8:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.