Live chat scraping (Youtube) with casper. Issue with selecting polymer elements
Asked Answered
A

1

8

I am trying to scrape the text from youtube live chat feeds using casper. I am having problems selecting the correct selector. There are many nested elements and dynamically generated elements for each new message that gets pushed out. How might one go about continually pulling the nested

<span id="message">some message</span>

as they occur? I currently can't seem to grab just even one! Here's my test code: note: you can substitute any youtube url that has a live chat feed.

const casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});
const ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
const url = "https://www.youtube.com/watch?v=NksKCLsMUsI";
casper.start();
casper.userAgent(ua)
casper.thenOpen(url, function() {
  this.wait(3000, function() {
    if (this.exists("span#message")) {
      this.echo("found the a message!");
    } else {
      this.echo("can't find a message");
    }
    casper.capture("test.png");
  });
});

casper.run();

My question is exactly this. How do i properly select the messages? And 2, how might i continually listen for new ones?

UPDATE: I have been playing with nightmare (electron testing suite) and that is looking promising however I still can't seem to select the chat elements. I know i'm missing something simple.

EDIT / UPDATE (using cadabra's fine example)

var casper = require("casper").create({
  viewportSize: {
    width: 1024,
    height: 768
  }
});

url = 'https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFJVRTFVVlkwZEV4MFRFVWdBUSUzRCUzRDAB'
ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'

casper.start(url)
casper.userAgent(ua);

var currentMessage = '';

(function getPosts() {
  var post = null;

  casper.wait(1000, function () {
    casper.capture('test.png')
    post = this.evaluate(function () {
      var nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
          author = nodes[nodes.length - 1].querySelector('#author-name').textContent,
          message = nodes[nodes.length - 1].querySelector('#message').textContent;

      return {
        author: author,
        message: message
      };
    });
  });

  casper.then(function () {
    if (currentMessage !== post.message) {
      currentMessage = post.message;
      this.echo(post.author + ' - ' + post.message);
    }
  });

  casper.then(function () {
    getPosts();
  });
})();

casper.run();
Accomplished answered 24/5, 2017 at 20:6 Comment(0)
A
5

This is much harder than you think... See what I tried, with no success:

1. Use ignore-ssl-errors option

YouTube uses HTTPS. This is a real problem for us because PhantomJS does not like SSL/TLS very much... Here we need to use ignore-ssl-errors. The option can be passed in command line:

casperjs --ignore-ssl-errors=true script.js

2. Access the chat page instead of the iframe

Comments we are trying to scrape are not in the main page. They come from an external page which is loaded in an iframe. In CasperJS, we could use the withFrame() method, but this is useless complexity for something we can access directly...

Main page | Chat page

3. Test with PhantomJS (WebKit) and SlimerJS (Gecko)

Due to YouTube limitations, both browsers give the same result:

Oh no!
It looks like you're using an older version of your browser. Please update it to use live chat.

If you want to test yourself, here is the script:

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

casper.wait(5000, function () {
  this.capture('chat.png');
});

casper.run();

PhantomJS: casperjs --ignore-ssl-errors=true script.js

SlimerJS: casperjs --engine=slimerjs script.js

Conclusion: You may need to use a real web browser like Firefox or Chromium to achieve this. An automation framework like Nightwatch.js could help...


EDIT 1

OK, so... Using your user-agent string, this is working:

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.userAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0');

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

casper.wait(5000, function () {
  this.each(this.evaluate(function () {
    var res = [],
        nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
        author = null,
        message = null;

    for (var i = 0; i < nodes.length; i++) {
      author = nodes[i].querySelector('#author-name').textContent.toUpperCase();
      message = nodes[i].querySelector('#message').textContent.toLowerCase();
      res.push(author + ' - ' + message);
    }

    return res;
  }), function (self, post) {
    this.echo(post);
  });
});

casper.run();

With this script, you should see the latest messages from the conversation in your terminal. :)


EDIT 2

Since the video is back, I spent some time modifying my previous code to implement real-time polling with a recursive IIFE. With the following script, I can get the latest comment in the chat stream. A request is sent every second to refresh the content and posts are filtered to avoid duplicates.

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.userAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0');

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

var currentMessage = '';

(function getPosts() {
  var post = null;

  casper.wait(1000, function () {
    post = this.evaluate(function () {
      var nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
          author = nodes[nodes.length - 1].querySelector('#author-name').textContent,
          message = nodes[nodes.length - 1].querySelector('#message').textContent;

      return {
        author: author,
        message: message
      };
    });
  });

  casper.then(function () {
    if (currentMessage !== post.message) {
      currentMessage = post.message;
      this.echo(post.author + ' - ' + post.message);
    }
  });

  casper.then(function () {
    getPosts();
  });
})();

casper.run();

It is working PERFECTLY on my computer.

Ataghan answered 25/5, 2017 at 23:34 Comment(13)
i was able to load the chat just fine by setting a modern user agent (see my example). i can see the chat just fine in my screenshot. also, i haven't had any ssl errors. thanks for the info though!Accomplished
in fact i can load the whole dom just fine i just can't seem to select the dynamic element <ty-live-chat-renderer>Accomplished
Oh, I see... This YouTube warning is only based on the user-agent string. I'll investigate on it to see what I can get. :)Ataghan
I'm checking out nightwatch.js now.Accomplished
You are right: the user-agent string does the trick. I can see comments on the screenshot...Ataghan
I see what you're doing there! I can't seem to get any 'posts' to actually show up. I've extended the wait time to 10s and tried multiple urls... Are you using the chat popout url?Accomplished
I really appreciate your help BTW. I'm echoing out the nodes via this.echo(JSON.stringify(nodes, null, 2) and getting a length of 0. I am also capturing the screen to make sure the chat is loading and it is... any ideas? Again. thanks! I was reading some posts about shadow doms and polymer elements with casper before i got hungry and ate dinner. I'm gonna look into that again and get back to ya. Also, I see that we're gonna need to figure out how to keep the connection open and scraping... maybe a recursive function on a timeout?Accomplished
Let us continue this discussion in chat.Accomplished
Getting a Type-error on the post.message saying it's null and not an object. I suspected that it's firing before the chat frame rendered but i captured the screen to be sure and it's there... ideas? PS. I like the IIFE situation you've got going there!Accomplished
What URL are you scraping? The current one is off for the moment being...Ataghan
My script is still working... Just replace the current URL with this one: youtube.com/…. Then, run casperjs --ignore-ssl-errors=true script.js.Ataghan
I'm still getting the type error. I'll paste my code above.Accomplished
Let us continue this discussion in chat.Accomplished

© 2022 - 2024 — McMap. All rights reserved.