How to classify URLs? what are URLs features? How to select and Extract features from URL
Asked Answered
G

1

7

I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block it.

My Question is very specific.

  1. How to Classify URLs? Should i use normal text analysis methods?
  2. What are URLs Features?
  3. How to Select and Extract Features from URL?
Gaudery answered 20/10, 2014 at 0:22 Comment(2)
I have dataset which has URLs. I want to train my model to classify URL as adults content or non-adult content. basically the model is for filtering purpose. want to block webpages which are objectionable, using URL with downloading the page contents and other features like meta data in webpages. so this is a two class problem. My question is How can we classify webpages from just using URLs features. The problem i am having is that what are the best features extraction method i can use?Gaudery
plus, Is there any API libraries which has build-in function for this purpose. I am new to machine learning, please correct me where i am wrong. i will be using python.Gaudery
T
8

I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

Here are some features I will try. See this paper for more ideas:

  1. All url components. For example, this page has the below url:

    https://mcmap.net/q/1501633/-how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features-from-url

All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

 * stackoverflow
 * com
 * questions
 * 26456904
 * how to classify urls what are urls features how to select and extract features
  1. The length of a url;
  2. n-grams (2-grams as examples below)
    • stackoverflow-com
    • com-questions
    • questions-26456904
    • 26456904-how
    • how-to
    • ....
Teary answered 21/10, 2014 at 0:6 Comment(4)
greeness, u explained it nicely i read some papers where they achieved to classify webpages by just using URL features. I am abit confuse in extracting features from URL which are simple. like www.google.com it do not have enough features. if i decide to extract 6 features from all URLs from datasets in training the algorithm, what will happend when simple URL get in the way?Gaudery
Most of the features you are using would be sparse. Instead of 6 features, you probably mean 6 types of features or 6 feature families. In google.com example, the only useful feature is the token "google", which should have strong connections to a label like "search engine". The connection should be learned from your labeled dataset. Therefore you don't need to worry about the insufficient feature at this example.Teary
Thanks Greenes, is it like i will tell my estimator/classifier that tokens which are in start on an example have more weight then tokens which reside in end of lengthy examples?Gaudery
It's better to let your machine learning model figure that out.Teary

© 2022 - 2024 — McMap. All rights reserved.