Puppeteer error: ProtocolError: Protocol error (Target.createTarget): Target closed [closed]
Asked Answered
S

1

0

I'm trying to scrape YouTube Shorts from a specific YouTube Channel, using Puppeteer running on MeteorJs Galaxy.

Here's the code that I've done so far:

import puppeteer from 'puppeteer';
import { YouTubeShorts } from '../imports/api/youTubeShorts'; //meteor mongo local instance

let URL = 'https://www.youtube.com/@ummahtoday1513/shorts'

const processShortsData = (iteratedData) => {
    let documentExist = YouTubeShorts.findOne({ videoId:iteratedData.videoId })
    if(documentExist === undefined) {  //undefined meaning this incoming shorts in a new one
        YouTubeShorts.insert({
            videoId: iteratedData.videoId,
            title: iteratedData.title,
            thumbnail: iteratedData.thumbnail,
            height: iteratedData.height,
            width: iteratedData.width
        })
    }
}

const fetchShorts = () => {
        puppeteer.launch({
            headless:true,
            args:[
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--single-process'
            ]
        })
        .then( async function(browser){
            async function fetchingData(){
                new Promise(async function(resolve, reject){
                    const page = await browser.newPage();
                
                    await Promise.all([
                        await page.setDefaultNavigationTimeout(0),
                        await page.waitForNavigation({waitUntil: "domcontentloaded"}),
                        await page.goto(URL, {waitUntil:["domcontentloaded", "networkidle2"]}),
                        await page.waitForSelector('ytd-rich-grid-slim-media', { visible:true }),
                        new Promise(async function(resolve,reject){
                            page.evaluate(()=>{
                                const trialData = document.getElementsByTagName('ytd-rich-grid-slim-media');
                                const titles = Array.from(trialData).map(i => {
                                    const singleData = {
                                        videoId: i.data.videoId,
                                        title: i.data.headline.simpleText,
                                        thumbnail: i.data.thumbnail.thumbnails[0].url,
                                        height: i.data.thumbnail.thumbnails[0].height,
                                        width: i.data.thumbnail.thumbnails[0].width,
                                    }
                                    return singleData
                                })
                                resolve(titles);
                            })
                        }),
                    ])
                    await page.close()
                })
                await browser.close()
            }

            async function fetchAndProcessData(){
                const datum = await fetchingData()
                console.log('DATUM:', datum)
            }
            await fetchAndProcessData()
        })
}

fetchShorts();

I am struggling with two things here:

  1. Async, await, and promises, and
  2. Finding reason behind why Puppeteer output the ProtocolError: Protocol error (Target.createTarget): Target closed. error in the console.

I'm new to puppeteer and trying to learn from various examples on StackOverflow and Google in general, but I'm still having trouble getting it right.

Sanderlin answered 20/2, 2023 at 4:1 Comment(0)
B
5

A general word of advice: code slowly and test frequently, especially when you're in an unfamiliar domain. Try to minimize problems so you can understand what's failing. There are many issues here, giving the impression that the code was written in one fell swoop without incremental validation. There's no obvious entry point to debugging this.

Let's examine some failing patterns.

First, basically never use new Promise() when you're working with a promise-based API like Puppeteer. This is discussed in the canonical What is the explicit promise construction antipattern and how do I avoid it? so I'll avoid repeating the answers there.

Second, don't mix async/await and then. The point of promises is to flatten code and avoid pyramids of doom. If you find you have 5-6 deeply nested functions, you're misusing promises. In Puppeteer, there's basically no need for then.

Third, setting timeouts to infinity with page.setDefaultNavigationTimeout(0) suppresses errors. It's fine if you want a long delay, but if a navigation is taking more than a few minutes, something is wrong and you want an error so you can understand and debug it rather than having the script wait silently until you kill it, with no clear diagnostics as to what went wrong or where it failed.

Fourth, watch out for pointless calls to waitForNavigation. Code like this doesn't make much sense:

await page.waitForNavigation(...);
await page.goto(...);

What navigation are you waiting for? This seems ripe for triggering timeouts, or worse yet, infinite hangs after you've set navs to never timeout.

Fifth, avoid premature abstractions. You have various helper functions but you haven't established functionally correct code, so these just add to the confused state of affairs. Start with correctness, then add abstractions once the cut points become obvious.

Sixth, avoid Promise.all() when all of the contents of the array are sequentially awaited. In other words:

await Promise.all([
  await foo(),
  await bar(),
  await baz(),
  await quux(),
  garply(),
]);

is identical to:

await foo();
await bar();
await baz();
await quux();
await garply();

Seventh, always return promises if you have them:

const fetchShorts = () => {
  puppeteer.launch({
  // ..

should be:

const fetchShorts = () => {
  return puppeteer.launch({
  // ..

This way, the caller can await the function's completion. Without it, it gets launched into the void and can never be connected with the caller's flow.

Eighth, evaluate doesn't have access to variables in Node, so this pattern doesn't work:

new Promise(resolve => {
  page.evaluate(() => resolve());
});

Instead, avoid the new promise antipattern and use the promise that Puppeteer already returns to you:

await page.evaluate(() => {});

Better yet, use $$eval here since it's an abstraction of the common pattern of selecting elements first thing in evaluate.

Putting all of this together, here's a rewrite:

const puppeteer = require("puppeteer"); // ^19.6.3

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("ytd-rich-grid-slim-media");
  const result = await page.$$eval("ytd-rich-grid-slim-media", els =>
    els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
      videoId,
      title: headline.simpleText,
      thumbnail: thumbnails[0].url,
      height: thumbnails[0].height,
      width: thumbnails[0].width,
    }))
  );
  console.log(result);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Note that I ensure browser cleanup with finally so the process doesn't hang in case the code throws.

Now, all we want is a bit of text, so there's no sense in loading much of the extra stuff YouTube downloads. You can speed up the script by blocking anything unnecessary to your goal:

const [page] = await browser.pages();

await page.setRequestInterception(true);
page.on("request", req => {
  if (
    req.url().startsWith("https://www.youtube.com") &&
    ["document", "script"].includes(req.resourceType())
  ) {
    req.continue();
  }
  else {
    req.abort();
  }
});
// ...

Note that ["domcontentloaded", "networkidle2"] is basically the same as "networkidle2" since "domcontentloaded" will happen long before "networkidle2". But please avoid "networkidle2" here since all you need is some text, which doesn't depend on all network resources.

Once you've established correctness, if you're ready to factor this to a function, you can do so:

const fetchShorts = async () => {
  const url = "<Your URL>";
  let browser;

  try {
    browser = await puppeteer.launch();
    const [page] = await browser.pages();
    await page.goto(url, {waitUntil: "domcontentloaded"});
    await page.waitForSelector("ytd-rich-grid-slim-media");
    return await page.$$eval("ytd-rich-grid-slim-media", els =>
      els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
        videoId,
        title: headline.simpleText,
        thumbnail: thumbnails[0].url,
        height: thumbnails[0].height,
        width: thumbnails[0].width,
      }))
    );
  }
  finally {
    await browser?.close();
  }
};

fetchShorts()
  .then(shorts => console.log(shorts))
  .catch(err => console.error(err));

But keep in mind, making the function responsible for managing the browser resource hampers its reusability and slows it down considerably. I usually let the caller handle the browser and make all of my scraping helpers accept a page argument:

const fetchShorts = async page => {
  const url = "<Your URL>";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("ytd-rich-grid-slim-media");
  return await page.$$eval("ytd-rich-grid-slim-media", els =>
    els.map(({data: {videoId, headline, thumbnail: {thumbnails}}}) => ({
      videoId,
      title: headline.simpleText,
      thumbnail: thumbnails[0].url,
      height: thumbnails[0].height,
      width: thumbnails[0].width,
    }))
  );
};

(async () => {
  let browser;

  try {
    browser = await puppeteer.launch();
    const [page] = await browser.pages();
    console.log(await fetchShorts(page));
  }
  catch (err) {
    console.error(err);
  }
  finally {
    await browser?.close();
  }
})();

Disclosure: I'm the author of linked blog posts

Blockish answered 20/2, 2023 at 5:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.