How to find whether a url is of ecommerce or non ecommerce website, programatically?

In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.

I have tried following approaches:

Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags

b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.

c) training model and then testing it.

Following params i have used for training

bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1

Testing:

/bin/mahout testclassifier \
  -d test-data \
  -m bayes-model \
  -type bayes -source hdfs -ng 1 -method sequential

Accuracy i am getting as 73% and with cbayes algorithm getting 52%.

I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.

Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?

Recommended topics

Hot tags