How to Build a web scraping with Puppeteer and Firebase Functions

Advertisements

Using Puppeteer, you can build a web scraping and deliver it to the web using Firebase methods.

Website scraping is the process of downloading and extracting the content of a web page. Here, we’ll use the New York Times website as our source of information. The top ten news headlines from the page will be scraped and displayed on the web page by the scraper. The Puppeteer headless browser is used to perform the scraping, and the web application is hosted on Firebase.

1. Initialize a Firebase Function

Assuming that you have already created a Firebase project, you can initialize the Firebase functions in a local environment by running the following command:

Advertisements
mkdir scraper
cd scraper
npx firebase init functions
cd functions
npm install puppeteer

Follow through the prompts to initialize the project. We are also installing the Puppeteer package from NPM to use the Puppeteer headless browser.

2. Create a Node.js Application

Create a new pptr.js file in the functions folder that will contain the application code for scraping the content of the page. The script will only download the HTML content of the page and block all images, stylesheets, videos and fonts to reduce the amount of time it takes to download the page.

We are using XPath expression to select headlines on the page that are wrapped under the h3 tag. You may use Chrome Dev Tools to find the XPath of the headlines.

Advertisements
const puppeteer = require('puppeteer');

const scrapeWebsite = async () => {
  let stories = [];
  const browser = await puppeteer.launch({
    headless: true,
    timeout: 20000,
    ignoreHTTPSErrors: true,
    slowMo: 0,
    args: [
      '--disable-gpu',
      '--disable-dev-shm-usage',
      '--disable-setuid-sandbox',
      '--no-first-run',
      '--no-sandbox',
      '--no-zygote',
      '--window-size=1280,720',
    ],
  });

  try {
    const page = await browser.newPage();

    await page.setViewport({ width: 1280, height: 720 });

    // Block images, videos, fonts from downloading
    await page.setRequestInterception(true);

    page.on('request', (interceptedRequest) => {
      const blockResources = ['script', 'stylesheet', 'image', 'media', 'font'];
      if (blockResources.includes(interceptedRequest.resourceType())) {
        interceptedRequest.abort();
      } else {
        interceptedRequest.continue();
      }
    });

    // Change the user agent of the scraper
    await page.setUserAgent(
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
    );

    await page.goto('https://www.nytimes.com/', {
      waitUntil: 'domcontentloaded',
    });

    const storySelector = 'section.story-wrapper h3';

    // Only get the top 10 headlines
    stories = await page.$$eval(storySelector, (divs) =>
      divs.slice(0, 10).map((div, index) => `${index + 1}. ${div.innerText}`)
    );
  } catch (error) {
    console.log(error);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
  return stories;
};

module.exports = scrapeWebsite;

3. Write the Firebase Function

Inside the index.js file, import the scraper function and export it as a Firebase function. We are also writing a scheduled function that will run every day and will call the scraper function.

It is important to increase the function memory and time out limits as Chrome with Puppeteer is a heavy resource.

// index.js
const functions = require('firebase-functions');
const scrapeWebsite = require('./pptr');

exports.scrape = functions
  .runWith({
    timeoutSeconds: 120,
    memory: '512MB' || '2GB',
  })
  .region('us-central1')
  .https.onRequest(async (req, res) => {
    const stories = await scrapeWebsite();
    res.type('html').send(stories.join('<br>'));
  });

exports.scrapingSchedule = functions.pubsub
  .schedule('09:00')
  .timeZone('America/New_York')
  .onRun(async (context) => {
    const stories = await scrapeWebsite();
    console.log('The NYT headlines are scraped every day at 9 AM EST', stories);
    return null;
  });

4. Deploy the Function

You can use the npm run serve command to execute the function locally and then visit to the endpoint on localhost to see how it works. npm run deploy is the command you use when you’re ready to move the function to the cloud.

Advertisements
Puppeteer Firebase Function

5. Test the Scheduled Function

This command will open an interactive shell where you can manually invoke scheduled functions with test data. For example, write the function name scrapingSchedule() here and hit enter.

Firebase Functions Shell

Advertisements

Leave a Comment