How to find whether a url is of ecommerce or non ecommerce website, programatically?
Asked Answered
M

1

6

In a project there is a module takes a URL and determines whether it is of "Ecommerce" or "NON-Ecommerce" website.

I have tried following approaches:

  1. Using Apache mahout, Classification : URL ---> Take html dump ---> pre process the html dump by a) removing all html tags

    b) removing stop words(a.k.a common words) like CDATA, href, value, and, of , between etc.

    c) training model and then testing it.

Following params i have used for training

bin/mahout trainclassifier \ -i training-data \ -o bayes-model \ > -type bayes -ng 1

Testing:

/bin/mahout testclassifier \
  -d test-data \
  -m bayes-model \
  -type bayes -source hdfs -ng 1 -method sequential

Accuracy i am getting as 73% and with cbayes algorithm getting 52%.

I am thinking to improve pre processing stage by extracting info which are found in ecommerce website like "Checkout button","pay pal link", "Prices/ dollar symbol", text like "Cash on delivery", "30 day gurantee" etc.

Any suggestions on how to extract this info or any other ways to predict a site as Ecommerce or Non-Ecommerce?

Mulatto answered 22/1, 2012 at 14:56 Comment(1)
Please format your question the next time a bit more carefully. And btw 70% accuracy is quite good for the start.Nestle
N
1

I am very astonished that you get such a good accuracy with just plain html extraction and a bayes classifier.

But you seem to be on the right track with the features like a checkout button and prices.

Here is a paper I found yesterday while reading about Yandex:

"To find out or to buy? Product review vs. Web shop classifier"

It is about how to distinct these two sites and some techniques they used. They also used SVM instead of naive bayes.

Nestle answered 23/1, 2012 at 8:31 Comment(1)
Thanks thomas. Paper is targeting the similar use case as ours.Mulatto

© 2022 - 2024 — McMap. All rights reserved.