Qt - Getting source (HTML code) of a web page hosted on the internet
Asked Answered
P

4

7

I want to get the source (HTML) of a webpage, for example the homepage of StackOverflow.

This is what I've coded so far:

QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url)));

QString html = response->readAll(); // Source should be stored here

But nothing happens! When I try to get the value of the html string it's empty ("").

So, what to do? I am using Qt 5.3.1.

Pleasing answered 25/7, 2014 at 23:34 Comment(0)
R
7

You have to add QEventLoop between.

QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url)));
QEventLoop event;
connect(response,SIGNAL(finished()),&event,SLOT(quit()));
event.exec();
QString html = response->readAll(); // Source should be stored here
Rustle answered 25/7, 2014 at 23:52 Comment(4)
This is bad advice, since you're writing asynchronous code as if it were synchronous. It isn't. If you didn't forget to actually exec() the event loop, you'd be exposing the asker to the arbitrary consequences of event.exec() potentially reentering this method, or any other methods. Since most people don't design their code with such complications in mind, I consider it a source of undefined behavior, liable to format your hard drive or launch a nuclear strike. Explicitly asynchronous coding, with help of C++11 and Qt 5 is a more peaceful alternative.Camillecamilo
Well the asker tried to get HTML by synchronous code,that is why I showed him this solution. And sometimes it is easier to do it that way . And thank you for pointing me the event.exec() mistake.Rustle
Yes, I agree that introducing undefined behavior into your application is easy. That doesn't mean you should do it. Qt makes asynchronous coding relatively easy thanks to signals/slots even in Qt 4. With C++11 and Qt 5 there's really zero excuse to suggesting spinning local event loops and similar craziness.Camillecamilo
Thank you so much. People didn't like your solution & said it's a really bad one, but indeed it's the only one which worked for me! Thanks again.Pleasing
A
9

You need to code it in asynchronous fashion. C++11 and Qt come to the rescue. Just remember that the body of the lambda will execute later from the event loop.

// https://github.com/KubaO/stackoverflown/tree/master/questions/html-get-24965972
#include <QtNetwork>
#include <functional>

void htmlGet(const QUrl &url, const std::function<void(const QString&)> &fun) {
   QScopedPointer<QNetworkAccessManager> manager(new QNetworkAccessManager);
   QNetworkReply *response = manager->get(QNetworkRequest(QUrl(url)));
   QObject::connect(response, &QNetworkReply::finished, [response, fun]{
      response->deleteLater();
      response->manager()->deleteLater();
      if (response->error() != QNetworkReply::NoError) return;
      auto const contentType =
            response->header(QNetworkRequest::ContentTypeHeader).toString();
      static QRegularExpression re("charset=([!-~]+)");
      auto const match = re.match(contentType);
      if (!match.hasMatch() || 0 != match.captured(1).compare("utf-8", Qt::CaseInsensitive)) {
         qWarning() << "Content charsets other than utf-8 are not implemented yet:" << contentType;
         return;
      }
      auto const html = QString::fromUtf8(response->readAll());
      fun(html); // do something with the data
   }) && manager.take();
}

int main(int argc, char *argv[])
{
   QCoreApplication app(argc, argv);
   htmlGet({"http://www.google.com"}, [](const QString &body){ qDebug() << body; qApp->quit(); });
   return app.exec();
}

Unless you're only using this code once, you should put the QNetworkManager instance as a member of your controller class, or in the main, etc.

Anadiplosis answered 26/7, 2014 at 0:24 Comment(4)
Just asking on your first if statement why do you use return if there is no NoError?Evaevacuant
@reggie_jimac I return if there is an error (the error status is other than NoError). If there is an error, there's likely no valid data, and further processing is pointless.Camillecamilo
Using objects after calling deleteLater can be considered bad style. If someone adds an operation that causes events processing in the middle, the code will become implicitly invalid.Undershoot
@PavelStrakhov Nested event loops do not process deleteLater events, for the very reason you cite. That's another reason why pretend-synchronous programming is bad.Camillecamilo
R
7

You have to add QEventLoop between.

QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url)));
QEventLoop event;
connect(response,SIGNAL(finished()),&event,SLOT(quit()));
event.exec();
QString html = response->readAll(); // Source should be stored here
Rustle answered 25/7, 2014 at 23:52 Comment(4)
This is bad advice, since you're writing asynchronous code as if it were synchronous. It isn't. If you didn't forget to actually exec() the event loop, you'd be exposing the asker to the arbitrary consequences of event.exec() potentially reentering this method, or any other methods. Since most people don't design their code with such complications in mind, I consider it a source of undefined behavior, liable to format your hard drive or launch a nuclear strike. Explicitly asynchronous coding, with help of C++11 and Qt 5 is a more peaceful alternative.Camillecamilo
Well the asker tried to get HTML by synchronous code,that is why I showed him this solution. And sometimes it is easier to do it that way . And thank you for pointing me the event.exec() mistake.Rustle
Yes, I agree that introducing undefined behavior into your application is easy. That doesn't mean you should do it. Qt makes asynchronous coding relatively easy thanks to signals/slots even in Qt 4. With C++11 and Qt 5 there's really zero excuse to suggesting spinning local event loops and similar craziness.Camillecamilo
Thank you so much. People didn't like your solution & said it's a really bad one, but indeed it's the only one which worked for me! Thanks again.Pleasing
U
6

QNetworkAccessManager works asynchronously. You call readAll() immediately after get(), but the request has not been made in that moment. You need to use QNetworkAccessManager::finished signal as shown in the documentation and move readAll() to the slot connected to this signal.

Undershoot answered 25/7, 2014 at 23:52 Comment(0)
J
0

A short answer including the essential part in C++17:

const auto manager = new QNetworkAccessManager(this);
connect(manager, &QNetworkAccessManager::finished,
        this, [](auto reply) {
            qDebug() << reply->readAll();
        });
manager->get(QNetworkRequest({ "https://www.google.com" }));
Joey answered 25/5, 2022 at 19:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.