counting words inside a webpage - McMap

About

counting words inside a webpage

Asked 18/9, 2017 at 4:15 Answered 28/9, 2017 at 15:54

Solved python-3.x urllib2 urllib urllib3

N

1

5

I need to count words that are inside a webpage using python3. Which module should I use? urllib?

Here is my Code:

def web():
    f =("urllib.request.urlopen("https://americancivilwar.com/north/lincoln.html")
    lu = f.read()
    print(lu)

Nada answered 18/9, 2017 at 4:15 Comment(4)

The above code is just to read the webpage not to count but i just want to get access to the distinct words first. – Nada 18/9, 2017 at 4:17

You could use bs4 and get all the text and then find the len of it – Jeffereyjefferies 18/9, 2017 at 4:21

For starters you should remove the (" from f =("urllib so that it says f = urllib. – Pyrexia 18/9, 2017 at 4:21

my code gave me even the html codes so I need to remove them. How may i do that? – Nada 18/9, 2017 at 15:21

K

18

With below self explained code you can get a good starting point for counting words within a web page:

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# We get the url
r = requests.get("https://en.wikiquote.org/wiki/Khalil_Gibran")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common()

If you want for example the first 10 most common words you just do:

total.most_common(10)

Which in this case outputs:

In [100]: total.most_common(10)
Out[100]: 
[('the', 2097),
 ('and', 1651),
 ('of', 998),
 ('in', 625),
 ('i', 592),
 ('a', 529),
 ('to', 529),
 ('that', 426),
 ('is', 369),
 ('my', 365)]

Koniology answered 28/9, 2017 at 15:54 Comment(10)

I don't know who gave me downvote for this question. down vote for no reason. – Nada 8/11, 2017 at 2:10

Not me. By the way if you found my answer useful please consider to upvote/accept it. – Koniology 8/11, 2017 at 2:13

My vote is not counted because of inadequate reputation. However, you have my up vote. I was just wandering if I can check python codes for plagiarism but I am not getting any response from anyone. – Nada 8/11, 2017 at 2:28

You can accept answers with the tick under the up and down vote – Koniology 8/11, 2017 at 2:29

Okay I just did that and thank you. Do you know of any online websites to check code for plagiarism for free? If i may ask? – Nada 8/11, 2017 at 3:26

No. Sorry. I don't know. You may ask a new question for that. Good luck. – Koniology 8/11, 2017 at 12:29

I found the above method might output not exact numbers as a paragraph can be within a div and viceversa. Nos sure how it works, but I found an interesting tool online for checking word counts within a website: wordcounter.net/website-word-count – Koniology 9/5, 2019 at 19:50

Using this with the example URL given (en.wikiquote.org/wiki/Kahlil_Gibran) shows it vastly overstates the words. E.g. CTRL+F and searching "the" on the actual page only returns 687 results at the time of writing, while this says "the" appears 2139 times. – Drawstring 30/12, 2020 at 2:32

@Drawstring you are right. The problem resides that many web pages have 'div' and 'p' one within the other or viceversa, as I mentioned in mi previous comment. – Koniology 6/1, 2021 at 22:55

Interesting, another tool I found that does the exact same thing pretty accurate webpage word counter randomtools.io/webpage-word-counter – Laudanum 6/3, 2022 at 9:7

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.