counting words inside a webpage
Asked Answered
N

1

5

I need to count words that are inside a webpage using python3. Which module should I use? urllib?

Here is my Code:

def web():
    f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
    lu = f.read()
    print(lu)
Nada answered 18/9, 2017 at 4:15 Comment(4)
The above code is just to read the webpage not to count but i just want to get access to the distinct words first.Nada
You could use bs4 and get all the text and then find the len of itJeffereyjefferies
For starters you should remove the (" from f =("urllib so that it says f = urllib.Pyrexia
my code gave me even the html codes so I need to remove them. How may i do that?Nada
K
18

With below self explained code you can get a good starting point for counting words within a web page:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common() 

If you want for example the first 10 most common words you just do:

total.most_common(10)

Which in this case outputs:

In [100]: total.most_common(10)
Out[100]: 
[('the', 2097),
 ('and', 1651),
 ('of', 998),
 ('in', 625),
 ('i', 592),
 ('a', 529),
 ('to', 529),
 ('that', 426),
 ('is', 369),
 ('my', 365)]
Koniology answered 28/9, 2017 at 15:54 Comment(10)
I don't know who gave me downvote for this question. down vote for no reason.Nada
Not me. By the way if you found my answer useful please consider to upvote/accept it.Koniology
My vote is not counted because of inadequate reputation. However, you have my up vote. I was just wandering if I can check python codes for plagiarism but I am not getting any response from anyone.Nada
You can accept answers with the tick under the up and down voteKoniology
Okay I just did that and thank you. Do you know of any online websites to check code for plagiarism for free? If i may ask?Nada
No. Sorry. I don't know. You may ask a new question for that. Good luck.Koniology
I found the above method might output not exact numbers as a paragraph can be within a div and viceversa. Nos sure how it works, but I found an interesting tool online for checking word counts within a website: wordcounter.net/website-word-countKoniology
Using this with the example URL given (en.wikiquote.org/wiki/Kahlil_Gibran) shows it vastly overstates the words. E.g. CTRL+F and searching "the" on the actual page only returns 687 results at the time of writing, while this says "the" appears 2139 times.Drawstring
@Drawstring you are right. The problem resides that many web pages have 'div' and 'p' one within the other or viceversa, as I mentioned in mi previous comment.Koniology
Interesting, another tool I found that does the exact same thing pretty accurate webpage word counter randomtools.io/webpage-word-counterLaudanum

© 2022 - 2024 — McMap. All rights reserved.