Possible Duplicate:
Text Classification into Categories
I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is being served.
I read a little bit about machine learning but I have no practical experience with it at all. Can anyone explain to me if/why it would a be better solution to a simple problem like this? I find accuracy more important than performance!
simplified example:
["China", "Chinese", "Rice", "Noodles", "Soybeans"]
["Belgium", "Belgian", "Fries", "Waffles", "Waterzooi"]
a possible description could be:
"Hong's Garden Restaurant offers savory, reasonably priced Chinese to our customers. If you find that you have a sudden craving for rice, noodles or soybeans at 8 o’clock on a Saturday evening, don’t worry! We’re open seven days a week and offer carryout service. You can get fries here as well!"
nltk
) to get "nouns", and then usepybrain
to train a neural net, but ultimately, were this for commercial purposes and I couldn't rely on machine learning to be completely accurate, I'd be inclined to think about splitting the DB into chunks of 500, and employ 20 people for a days work – Sleeper