How to Download webpage as .mhtml
Asked Answered
S

5

8

I am able to successfully open a URL and save the resultant page as a .html file. However, I am unable to determine how to download and save a .mhtml (Web Page, Single File).

My code is:

import urllib.parse, time
from urllib.parse import urlparse
import urllib.request

url = ('https://www.example.com')

encoded_url = urllib.parse.quote(url, safe='')

print(encoded_url)

base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")

translation_url = base_url+encoded_url

print(translation_url)

req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})

print(req)

response = urllib.request.urlopen(req)

time.sleep(15)

print(response)

webContent = response.read()

print(webContent)

f = open('GoogleTranslated.html', 'wb')

f.write(webContent)

print(f)

f.close

I have tried to use wget using the details captured in this question: How to download a webpage (mhtml format) using wget in python but the details are incomplete (or I am simply unabl eto understand).

Any suggestions would be helpful at this stage.

Sethsethi answered 22/2, 2020 at 12:8 Comment(5)
What error did you get when using wget?Suborder
I was unable to determine how to take the syntax (options) provided in the wget case I referenced with wget as it is used in Python. I was able to successfully download a html file using wget using the syntax: import wget wget.download("example.com", "test.html")Sethsethi
The linked question's only answer shows how to download a page tree, but doesn't show how to save it as .mhtml. I don't think there's a way to do that with wget but it should not be hard to do with Python once you understand the format. Basically, create an email.message.EmailMessage and attach each downloaded page element.Thundersquall
@Thundersquall - I should point out that I have used the browser based "Save As" option and the only options which provides me with a truly 'offline' version of the page is "Web Page, Complete". It would seem that .mhtml option is also not appropriate. Finally, all this is related to me trying to save the output of a google translate request. Will the email.message.EmailMessage option you mentioned work in my case? Thanks.Sethsethi
It's the format used as the MHTML container, what you save and how it's useful is up to you. If you want a translation, why do you care about anything else on the page?Thundersquall
H
4

Did you try using Selenium with a Chrome Webdriver to save page?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)


# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
    pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
Housecarl answered 22/2, 2020 at 13:33 Comment(2)
Worked perfectly. Thank you!! Had to specify the location of the chromdriver in the python script even though I added it to my path.Sethsethi
How to select the Save as type: Webpage, Single File(*.mhml)?Zebrass
F
8

Compared with previous answers, my solution does not involve any controlled mouse or keyboard operations. Also downloaded mhtml files could be stroed in any location you provide. I learnt this method from a Chinese blog. The key idea is using the chrome-dev-tools command.

The code is shown below as an example.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.qq.com/')

# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})

# Write the file locally
with open('./store/qq.mhtml', 'w', newline='') as f:   
    f.write(res['data'])

driver.quit()

Hope this will help! And you may checkout about chrome dev protocols here.

Final answered 13/7, 2022 at 9:29 Comment(1)
I think the Chinese blog link is cnblogs.com/superhin/p/12600358.html. That's a very smart method.Sobel
H
4

Did you try using Selenium with a Chrome Webdriver to save page?

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''

# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)


# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
    pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
Housecarl answered 22/2, 2020 at 13:33 Comment(2)
Worked perfectly. Thank you!! Had to specify the location of the chromdriver in the python script even though I added it to my path.Sethsethi
How to select the Save as type: Webpage, Single File(*.mhml)?Zebrass
G
2

save as mhtml, need to add argument '--save-page-as-mhtml'

options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)

Garica answered 17/6, 2021 at 11:41 Comment(0)
L
0

I wrote it just the way it was. Sorry if it's wrong.
I created a class, so you can use it. The example is in the three lines below.
Also, you can change the number of seconds to sleep as you like.
Incidentally, non-English keyboards such as Japanese and Hangul keyboards are also supported.

import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid


class DonwloadMhtml(webdriver.Chrome):
    def __init__(self):
        super().__init__()
        self._first_save = True
        time.sleep(2)

    
    def save_page(self, url, filename=None):
        self.get(url)


        time.sleep(3)
        # open 'Save as...' to save html and assets
        pyautogui.hotkey('ctrl', 's')
        time.sleep(1)

        if filename is None:
            pyperclip.copy(str(uuid.uuid4()))
        else:
            pyperclip.copy(filename)
            
        time.sleep(1)
        pyautogui.hotkey('ctrl', 'v')
        time.sleep(2)
        
        
        if self._first_save:
            pyautogui.hotkey('tab')
            time.sleep(1)
            pyautogui.press('down')
            time.sleep(1)
            pyautogui.press('up')
            time.sleep(1)
            pyautogui.hotkey('enter')
            time.sleep(1)
            self._first_save = False
            
        pyautogui.hotkey('enter')
        time.sleep(1)


# example
dm = DonwloadMhtml()


dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python')         # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/')                                                                 # file named randomly based on uuid4

python3.8.10
selenium==4.4.3

Leningrad answered 3/1, 2023 at 13:19 Comment(0)
P
0
# --coding:utf-8-- 
# author = 'AlenWesker'
import os
import os.path

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui

TIME_OUT = 60
MAX_LENGTH_GUI = 1024


def split_string(input_string, max_length):
    result = []
    for i in range(0, len(input_string), max_length):
        result.append(input_string[i:i + max_length])
    return result


def download_as_mhtml(url, path):
    # https://mcmap.net/q/1279871/-how-to-download-webpage-as-mhtml
    # New to download Chrome driver: http://chromedriver.storage.googleapis.com/index.html
    # Or https://googlechromelabs.github.io/chrome-for-testing/
    # Make sure you download the correct version
    # chromedriver.exe Place to  chrome's exe folder, add folder to your system path. Make sure you can run chromedriver directly in cmd.exe
    URL = url
    FILE_NAME = path

    # open page with selenium
    # (first need to download Chrome webdriver, or a firefox webdriver, etc)
    options = webdriver.ChromeOptions()
    options.add_argument('--save-page-as-mhtml')
    driver = webdriver.Chrome(options=options)
    # driver = webdriver.Chrome()
    driver.get(URL)

    # wait until body is loaded
    WebDriverWait(driver, TIME_OUT).until(visibility_of_element_located((By.TAG_NAME, 'body')))
    time.sleep(1)
    # open 'Save as...' to save html and assets
    pyautogui.hotkey('ctrl', 's')
    time.sleep(1)
    if FILE_NAME != '':
        for s in split_string(FILE_NAME, MAX_LENGTH_GUI):
            pyautogui.typewrite(s)
    time.sleep(1)
    pyautogui.hotkey('enter')
    pyautogui.hotkey('alt', 's')  # You need to trigger it
    # time.sleep(20)
    driver.implicitly_wait(10)  # You need to wait for some time


if __name__ == '__main__':
    # dir = os.path.join(os.path.dirname(__file__), "../temp");
    dir = "d:\\temp"  # Don't make it too long, I have not figured why pyautogui can't input long string
    # dir = ""
    p = os.path.normpath(os.path.join(dir, "poet.mhtml"))  # .replace('\\', '/')
    download_as_mhtml("https://poet.so", p) # Make sure your default chrome can access it

My version is guaranteed to run in Chrome 120. And I have listed all the steps within. Based on all the above, thank you guys.

Partnership answered 31/12, 2023 at 14:5 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.