Website scraping is the process of downloading and extracting the content of a web page. Here, we’ll use the New York Times website as our source of information. In order to display the most recent headlines, a scraper will take the top 10 headlines off the website and display them. The Puppeteer headless browser is used to perform the scraping, and the web application is hosted on Firebase.
Initialize a Firebase Function
If you have previously set up a Firebase project, you can use the following command to activate the functionalities in a local environment:
mkdir scraper
cd scraper
npx firebase init functions
cd functions
npm install puppeteer
To get the project up and running, follow the onscreen instructions. Using the Puppeteer headless browser requires the NPM package Puppeteer, which we’ve installed
Create a Node.js Application
A new pptr.js file should be created in the functions folder, which will include the application code for scraping material from a page, To speed up downloads, the script just downloads the page’s HTML code and ignores any other resources, such as pictures, stylesheets, videos, and fonts.
Headlines on the page that are contained within a h3 element are selected using an XPath query. You may get the XPath of the headlines using Chrome Dev Tools.
const puppeteer = require('puppeteer');
const scrapeWebsite = async () => {
let stories = [];
const browser = await puppeteer.launch({
headless: true,
timeout: 20000,
ignoreHTTPSErrors: true,
slowMo: 0,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--window-size=1280,720',
],
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 });
// Block images, videos, fonts from downloading
await page.setRequestInterception(true);
page.on('request', (interceptedRequest) => {
const blockResources = ['script', 'stylesheet', 'image', 'media', 'font'];
if (blockResources.includes(interceptedRequest.resourceType())) {
interceptedRequest.abort();
} else {
interceptedRequest.continue();
}
});
// Change the user agent of the scraper
await page.setUserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
);
await page.goto('https://www.nytimes.com/', {
waitUntil: 'domcontentloaded',
});
const storySelector = 'section.story-wrapper h3';
// Only get the top 10 headlines
stories = await page.$$eval(storySelector, (divs) =>
divs.slice(0, 10).map((div, index) => `${index + 1}. ${div.innerText}`)
);
} catch (error) {
console.log(error);
} finally {
if (browser) {
await browser.close();
}
}
return stories;
};
module.exports = scrapeWebsite;
3. Write the Firebase Function
The scraper function should be imported into the index.js file and exported as a Firebase function. In addition, we’re building a daily-running scheduled function that calls the scraper method.
Because Chrome with Puppeteer is such a resource hog, increasing the function memory and time out limitations is critical.
// index.js
const functions = require('firebase-functions');
const scrapeWebsite = require('./pptr');
exports.scrape = functions
.runWith({
timeoutSeconds: 120,
memory: '512MB' || '2GB',
})
.region('us-central1')
.https.onRequest(async (req, res) => {
const stories = await scrapeWebsite();
res.type('html').send(stories.join('<br>'));
});
exports.scrapingSchedule = functions.pubsub
.schedule('09:00')
.timeZone('America/New_York')
.onRun(async (context) => {
const stories = await scrapeWebsite();
console.log('The NYT headlines are scraped every day at 9 AM EST', stories);
return null;
});
Deploy the Function
If you wish to test the function locally, you may run the npm run serve command and navigate to the function endpoint on localhost. When you are ready to deploy the function to the cloud, the command is npm run deploy.
Test the Scheduled Function
If you would like to test the scheduled function locally, you can run the command npm run shell to open an interactive shell for invoking functions manually with test data. Here type the function name scrapingSchedule() and hit enter to get the function output.