In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.
I have tried following approaches:
Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags
b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.
c) training model and then testing it.
Following params i have used for training
bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1
Testing:
/bin/mahout testclassifier \
-d test-data \
-m bayes-model \
-type bayes -source hdfs -ng 1 -method sequential
Accuracy i am getting as 73% and with cbayes algorithm getting 52%.
I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.
Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?