Why does headless need to be false for Puppeteer to work?
Asked Answered
R

3

4

I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console

and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.

Here's my code:

express()
  .get("/*", (req, res) => {
    global.notBaseURL = req.params[0];
    (async () => {
      const browser = await puppet.launch({ headless: false }); // Line of Interest
      const page = await browser.newPage();
      console.log(req.params[0]);
      await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
      title = await page.$eval("title", (el) => el.innerText);

      browser.close();

      res.send({
        title: title,
      });
    })();
  })
  .listen(PORT, () => console.log(`Listening on ${PORT}`));

This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK

Resinate answered 9/9, 2020 at 20:8 Comment(0)
O
10

The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.

Some possible workarounds:

Use puppeteer-extra

Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:

  1. puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.
  2. puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.

Run a "real" Chromium instance/UI

It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0

Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions

The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.

This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)

There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!

Outermost answered 9/9, 2020 at 22:45 Comment(0)
L
5

Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to try manually setting a human user agent, based on the Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true }:

const ua =
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
await page.setUserAgent(ua);
await page.goto(yourURL);

Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false, at least at the present moment. But other sites are less strict and I've found the above line to be useful on some of them as shown in Puppeteer can't find elements when Headless TRUE and Puppeteer , bringing back blank array, among many other cases.

Visit the GH issue thread above for other ideas and see useragents.me and the user-agents npm package for a rotating list of current user agents. The one provided here may not work.

https://bot.sannysoft.com/ is a useful tool for checking to what extent your script may be seen as a bot.

Laspisa answered 1/2, 2022 at 6:50 Comment(0)
H
0

on top of todd's answer I would recommend to use args: ['--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'] while launching puppeteer so set userAgent at browser level, so that even if you open new tabs, user agent does not change.

Helladic answered 17/7, 2024 at 11:37 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.