Puppeteer on Heroku Error R10 (Boot timeout) Node (webscraping app)
Asked Answered
B

2

2

I created a web scraping app, which checks for a certain problem on an ecommerce website.

What it does:

  • Loops through an array of pages
  • checks for a condition on every page
  • if condition is met - pushes page to temparray
  • sends an email with temparray as body

I wrapped that function in a cronjob function. On my local machine it runs fine.

Deployed like this:

  • headless: true
  • '--no-sandbox',
  • '--disable-setuid-sandbox'
  • Added the pptr buildpack link to settings in heroku
  • slugsize is 259.6 MiB of 500 MiB

It didnt work.

  • set boot timeout to 120s (instead of 60s)

It worked. But only ran once.

Since it want to run that function several times per day, I need to fix the issue.

I have another app running which uses the same cronjob and notification function and it works on heroku.

Here's my code, if anyone is interested.

const puppeteer = require('puppeteer');
const nodemailer = require("nodemailer");
const CronJob = require('cron').CronJob;
let articleInfo ='';
const mailArr = [];
let body = '';

const testArr = [
    'https://bxxxx..', https://b.xxx..', https://b.xxxx..',
];

async function sendNotification() {

    let transporter = nodemailer.createTransport({
      host: 'mail.brxxxxx.dxx',
      port: 587,
      secure: false,
      auth: {
        user: '[email protected]',
        pass: process.env.heyBfPW2
      }
    });
  
    let textToSend = 'This is the heading';
    let htmlText = body;
  
    let info = await transporter.sendMail({
      from: '"BB Checker" <hey@baxxxxx>',
      to: "[email protected]",
      subject: 'Hi there',
      text: textToSend,
      html: htmlText
    });
    console.log("Message sent: %s", info.messageId);
  }

async function boxLookUp (item) {
    const browser = await puppeteer.launch({
        headless: true,
        args: [
            '--no-sandbox',
            '--disable-setuid-sandbox',
          ],
    });
    const page = await browser.newPage();
    await page.goto(item);
    const content = await page.$eval('.set-article-info', div => div.textContent);
    const title = await page.$eval('.product--title', div => div.textContent);
    const orderNumber = await page.$eval('.entry--content', div => div.textContent);
    
    // Check if deliveryTime is already updated
    try {
        await page.waitForSelector('.delivery--text-more-is-coming');
    // if not
      } catch (e) {
        if (e instanceof puppeteer.errors.TimeoutError) {
          // if not updated check if all parts of set are available 
          if (content != '3 von 3 Artikeln ausgewählt' && content != '4 von 4 Artikeln ausgewählt' && content != '5 von 5 Artikeln ausgewählt'){
            articleInfo = `${title} ${orderNumber} ${item}`;
            mailArr.push(articleInfo) 
            }
        }
      }
    await browser.close();
};  

    const checkBoxes = async (arr) => {
    
    for (const i of arr) {
        await boxLookUp(i);
   }
   
   console.log(mailArr)
   body = mailArr.toString();
   sendNotification();
}

async function startCron() {
   
    let job = new CronJob('0 */10 8-23 * * *', function() {  // run every_10_minutes_between_8_and_11
        checkBoxes(testArr);
    }, null, true, null, null, true);
    job.start();
}

startCron();
Buckjumper answered 12/5, 2021 at 6:11 Comment(2)
Maybe this help: https://mcmap.net/q/245628/-heroku-error-error-r10-boot-timeout-gt-web-process-failed-to-bind-to-port-within-60-seconds-of-launch If you scrape only you can use a worker Dyno (and avoid the issue not binding to the port)Treillage
Thanks. I added a procfile like so: "worker: node nodeMailerCheck.js" But the same err was thrown.Buckjumper
S
1

Had the same issue for 3 days now. Here something that might help: https://mcmap.net/q/245629/-heroku-boot-timeout-error-r10

Has to be done alongside the Procfile thing.

Skerl answered 15/5, 2021 at 13:53 Comment(0)
T
1

Assuming the rest of the code works (nodemailer, etc), I'll simplify the problem to focus purely on running a scheduled Node Puppeteer task in Heroku. You can re-add your mailing logic once you have a simple example running.

Heroku runs scheduled tasks using simple job scheduling or a custom clock process.

Simple job scheduling doesn't give you much control, but is easier to set up and potentially less expensive in terms of billable hours if you're running it infrequently. The custom clock, on the other hand, will be a continuously-running process and therefore chew up hours.

A custom clock process can do your cron task exactly, so that's probably the natural fit for this case.

For certain scenarios, you can sometimes workaround on the simple scheduler to do more complicated schedules by having it exit early or by deploying multiple apps.

For example, if you want a twice-daily schedule, you could have two apps that run the same task scheduled at different hours of the day. Or, if you wanted to run a task twice weekly, schedule it to run daily using the simple scheduler, then have it check its own time and exit immediately if the current day isn't one of the two desired days.

Regardless of whether you use a custom clock or simple scheduled task, note that long-running tasks really should be handled by a background task, so the examples below aren't production-ready. That's left as an exercise for the reader and isn't Puppeteer-specific.


Custom clock process

package.json:

{
  "name": "test-puppeteer",
  "version": "1.0.0",
  "description": "",
  "scripts": {
    "start": "echo 'running'"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "cron": "^1.8.2",
    "puppeteer": "^9.1.1"
  }
}

Procfile

clock:  node clock.js

clock.js:

const {CronJob} = require("cron");
const puppeteer = require("puppeteer");

// FIXME move to a worker task; see https://devcenter.heroku.com/articles/node-redis-workers
const scrape = async () => {
  const browser = await puppeteer.launch({
    args: ["--no-sandbox", "--disable-setuid-sandbox"]
  });
  const [page] = await browser.pages();
  await page.setContent(`<p>clock running at ${Date()}</p>`);
  console.log(await page.content());
  await browser.close();
};

new CronJob({
  cronTime: "30 * * * * *", // run every 30 seconds for demonstration purposes
  onTick: scrape,
  start: true,
});

Set up

  1. Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):

    heroku create
    heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
    heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
    

    (replace cryptic-dawn-48835 with your app name)

  2. Deploy:

    git init
    git add .
    git commit -m "initial commit"
    heroku git:remote -a cryptic-dawn-48835
    git push heroku master
    
  3. Add a clock process:

    heroku ps:scale clock=1
    
  4. Verify that it's running with heroku logs --tail. heroku ps:scale clock=0 turns off the clock.


Simple scheduler

package.json:

Same as above, but no need for cron. No need for a Procfile either.

task.js:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({
    args: ["--no-sandbox", "--disable-setuid-sandbox"]
  });
  const [page] = await browser.pages();
  await page.setContent(`<p>scheduled job running at ${Date()}</p>`);
  console.log(await page.content());
  await browser.close();
})();

Set up

  1. Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):

    heroku create
    heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
    heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
    

    (replace cryptic-dawn-48835 with your app name)

  2. Deploy:

    git init
    git add .
    git commit -m "initial commit"
    heroku git:remote -a cryptic-dawn-48835
    git push heroku master
    
  3. Add a scheduler:

    heroku addons:add scheduler:standard -a cryptic-dawn-48835
    

    Configure the scheduler by running:

    heroku addons:open scheduler -a cryptic-dawn-48835
    

    This opens a browser and you can add a command node task.js to run every 10 minutes.

  4. Verify that it worked after 10 minutes with heroku logs --tail. The online scheduler will show the time of next/previous execution.


See this answer for creating an Express-based web app on Heroku with Puppeteer.

Tatianna answered 19/5, 2021 at 1:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.