Using CefSharp.Offscreen to retrieve a web page that requires Javascript to render
Asked Answered
E

2

12

I have what is hopefully a simple task, but it's going to take someone that's versed in CefSharp to solve it.

I have an url that I want to retrieve the HTML from. The problem is this particular url doesn't actually distribute the page on a GET. Instead, it pushes a mound of Javascript to the browser, which then executes and produces the actual rendered page. This means that the usual approaches involving HttpWebRequest and HttpWebResponse aren't going to work.

I've looked at a number of different "headless" options, and the one that I think best meets my needs for a number of reasons is CefSharp.Offscreen. But I'm at a loss as to how this thing works. I see that there are several events that can be subscribed to, and some configuration options, but I don't need anything like an embedded browser.

All I really need is a way to do something like this (pseudocode):

string html = CefSharp.Get(url);

I don't have a problem subscribing to events, if that's what's needed to wait for the Javascript to execute and produce the rendered page.

Eldridge answered 18/2, 2016 at 1:45 Comment(5)
See gist.github.com/amaitland/9d8897067bdff5b999a1 should get you started.Whirlwind
@amaitland: Thanks. What is the current way to wait for the Javascript to execute and the page to fully render before getting the resulting HTML? NavStateChangedEventArgs doesn't appear to exist anymore.Eldridge
NavStateChanged = LoadingStateChanged. There is no event that waits for javascript to finish executing, the best you get out of the box is the page has finished loading. I've seen people just wait for a period of time, which I guess works in some cases. You might find it easiest to inject some javascript, check some conditions on the page.Whirlwind
gist is removed @WhirlwindSeparative
You can try github.com/cefsharp/CefSharp/blob/master/CefSharp.Test/… Otherwise ask your own question and provide detail of your specific use case @SeparativeWhirlwind
E
5

If you can't get a headless version of Chromium to help you, you could try node.js and jsdom. Easy to install and play with once you have node up and running. You can see simple examples on Github README where they pull down a URL, run all javascript, including any custom javascript code (example: jQuery bits to count some type of elements), and then you have the HTML in memory to do what you want. You can just do $('body').html() and get a string, like in your pseudo code. (This even works for stuff like generating SVG graphics since that is just more XML tree nodes.)

If you need this as part of a larger C# app that you need to distribute, your idea to use CefSharp.Offscreen sounds reasonable. One approach might be to get things working with CefSharp.WinForms or CefSharp.WPF first, where you can literally see things, then try CefSharp.Offscreen later when this all works. You can even get some JavaScript running in the on-screen browser to pull down body.innerHTML and return it as a string to the C# side of things before you go headless. If that works, the rest should be easy.

Perhaps start with CefSharp.MinimalExample and get that compiling, then tweak it for your needs. You need to be able to set webBrowser.Address in your C# code, and you need to know when the page has Loaded, then you need to call webBrowser.EvaluateScriptAsync(".. JS code ..") with your JavaScript code (as a string) which will do something as described (returning bodyElement.innerHTML as a string).

Extravagant answered 18/2, 2016 at 3:38 Comment(0)
B
11

I know I am doing some archaeology reviving a 2yo post, but a detailed answered may be of use for someone else.

So yes, Cefsharp.Offscreen is fit to the task.

Here under is a class which will handle all the browser activity.

using System;
using System.IO;
using System.Threading;
using CefSharp;
using CefSharp.OffScreen;

namespace [whatever]
{
    public class Browser
    {

        /// <summary>
        /// The browser page
        /// </summary>
        public ChromiumWebBrowser Page { get; private set; }
        /// <summary>
        /// The request context
        /// </summary>
        public RequestContext RequestContext { get; private set; }

        // chromium does not manage timeouts, so we'll implement one
        private ManualResetEvent manualResetEvent = new ManualResetEvent(false);

        public Browser()
        {
            var settings = new CefSettings()
            {
                //By default CefSharp will use an in-memory cache, you need to     specify a Cache Folder to persist data
                CachePath =     Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData), "CefSharp\\Cache"),
            };

            //Autoshutdown when closing
            CefSharpSettings.ShutdownOnExit = true;

            //Perform dependency check to make sure all relevant resources are in our     output directory.
            Cef.Initialize(settings, performDependencyCheck: true, browserProcessHandler: null);

            RequestContext = new RequestContext();
            Page = new ChromiumWebBrowser("", null, RequestContext);
            PageInitialize();
        }

        /// <summary>
        /// Open the given url
        /// </summary>
        /// <param name="url">the url</param>
        /// <returns></returns>
        public void OpenUrl(string url)
        {
            try
            {
                Page.LoadingStateChanged += PageLoadingStateChanged;
                if (Page.IsBrowserInitialized)
                {
                    Page.Load(url);

                    //create a 60 sec timeout 
                    bool isSignalled = manualResetEvent.WaitOne(TimeSpan.FromSeconds(60));
                    manualResetEvent.Reset();

                    //As the request may actually get an answer, we'll force stop when the timeout is passed
                    if (!isSignalled)
                    {
                        Page.Stop();
                    }
                }
            }
            catch (ObjectDisposedException)
            {
                //happens on the manualResetEvent.Reset(); when a cancelation token has disposed the context
            }
            Page.LoadingStateChanged -= PageLoadingStateChanged;
        }

        /// <summary>
        /// Manage the IsLoading parameter
        /// </summary>
        /// <param name="sender"></param>
        /// <param name="e"></param>
        private void PageLoadingStateChanged(object sender, LoadingStateChangedEventArgs e)
        {
            // Check to see if loading is complete - this event is called twice, one when loading starts
            // second time when it's finished
            if (!e.IsLoading)
            {
                manualResetEvent.Set();
            }
        }

        /// <summary>
        /// Wait until page initialization
        /// </summary>
        private void PageInitialize()
        {
            SpinWait.SpinUntil(() => Page.IsBrowserInitialized);
        }
    }
}

Now in my app I just need to do the following:

public MainWindow()
{
    InitializeComponent();
    _browser = new Browser();
}

private async void GetGoogleSource()
{
    _browser.OpenUrl("http://icanhazip.com/");
    string source = await _browser.Page.GetSourceAsync();
}

And here is the string I get

"<html><head></head><body><pre style=\"word-wrap: break-word; white-space: pre-wrap;\">NotGonnaGiveYouMyIP:)\n</pre></body></html>"

Berliner answered 14/8, 2018 at 20:14 Comment(3)
Is there a reason your using SpinWait? You could be using a TaskCompletionSource and simply await on the browser to load. see github.com/cefsharp/CefSharp/blob/cefsharp/67/…Whirlwind
I am using what I know ;) I'll give a look on this (when I have time). I don't like so much the spinwait anyway but as the initialization is a few tens of milliseconds it does not bother my app too much there.Berliner
Take a look on the line 67. github.com/cefsharp/CefSharp.MinimalExample/blob/…Triphylite
E
5

If you can't get a headless version of Chromium to help you, you could try node.js and jsdom. Easy to install and play with once you have node up and running. You can see simple examples on Github README where they pull down a URL, run all javascript, including any custom javascript code (example: jQuery bits to count some type of elements), and then you have the HTML in memory to do what you want. You can just do $('body').html() and get a string, like in your pseudo code. (This even works for stuff like generating SVG graphics since that is just more XML tree nodes.)

If you need this as part of a larger C# app that you need to distribute, your idea to use CefSharp.Offscreen sounds reasonable. One approach might be to get things working with CefSharp.WinForms or CefSharp.WPF first, where you can literally see things, then try CefSharp.Offscreen later when this all works. You can even get some JavaScript running in the on-screen browser to pull down body.innerHTML and return it as a string to the C# side of things before you go headless. If that works, the rest should be easy.

Perhaps start with CefSharp.MinimalExample and get that compiling, then tweak it for your needs. You need to be able to set webBrowser.Address in your C# code, and you need to know when the page has Loaded, then you need to call webBrowser.EvaluateScriptAsync(".. JS code ..") with your JavaScript code (as a string) which will do something as described (returning bodyElement.innerHTML as a string).

Extravagant answered 18/2, 2016 at 3:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.