How to scrape Amazon products using Node.js and Puppeteer

Have you ever wanted to extract data from Amazon's vast database of products for market research, price monitoring, or competitive analysis?

In this tutorial, we'll explore how to use Node.js and Puppeteer, a popular library for controlling headless Chrome or Chromium, to scrape some random Amazon products.

Before going any further, it is important to note that Amazon may use various techniques to prevent or detect scraping, such as captchas or IP blocking.

As such, it is important to use scraping responsibly and ethically to avoid being blocked :)

With that in mind, scraping Amazon, or any other website, can be a useful tool for gathering data on products, prices, and customer reviews. This information can be used for market research, competitive analysis, or price monitoring. It can also be used to automate certain actions, for example buying a product under a certain price.

1. Set up a Node.js project

First, you need to install Node.js on your computer. You can download it from the official website: https://nodejs.org/en/download/

Once Node.js is installed, open a terminal or command prompt window and navigate to the directory where you want to create your project.

bash

1mkdir amazon-puppeteer && cd amazon-puppeteer

In your new directory, to create a new Node.js project, run the following command:

bash

1npm init

Follow the prompts to set up your project. This will create a package.json file in your project directory.

That's it! Your Node.js project is ready!

2. Install Puppeteer

You'll first need to install Puppeteer, which is a Node.js library for controlling headless Chrome or Chromium. This library will allow you to simulate a web browser and interact with web pages programmatically.

To install Puppeteer, run the following command in your project directory:

bash

1npm install puppeteer

3. Write the script

Now, let the serious stuff begin. Create a new file in your project directory called scrape.js. This will contain the script for scraping Amazon products. In this new file, import Puppeteer and create an async function called scrapeProducts:

javascript

1const puppeteer = require('puppeteer');  
2  
3const scrapeProducts = async () => {  
4  // your code goes here  
5}

Within the scrapeProducts function, launch a new instance of Puppeteer and create a new page:

javascript

1const browser = await puppeteer.launch();  
2const page = await browser.newPage();

Use page.goto method to navigate to the Amazon website and search for a product. For example, to search for "JavaScript book", you can use the following code:

javascript

1await page.goto('https://www.amazon.com/');  
2await page.type('#twotabsearchtextbox', 'JavaScript book');  
3await page.click('#nav-search-submit-text');  
4await page.waitForNavigation();

Next, use page.evaluate method to extract data from the search results. For example, to extract the title, price, and image URL of the products, you can use the following code:

javascript

1const products = await page.evaluate(() => {  
2    let results = [];  
3    const items = document.querySelectorAll(".s-result-item .sg-col-inner");  
4    for (let i = items.length; i--; ) {  
5      const item = items[i];  
6      const title = item.querySelector("h2 > a > span");  
7      const price = item.querySelector(".a-price-whole");  
8      const cents = item.querySelector(".a-price-fraction");  
9      const image = item.querySelector("img");  
10      if (!title || !price || !image) continue;  
11      results = [...results, {  
12        title: title.innerText,  
13        price: parseFloat(`${parseInt(price.innerText)}.${parseInt(cents.innerText)}`),  
14        image: image.getAttribute("src")  
15      }]  
16    }  
17    return results;  
18  });  
19  
20  console.log(products);

Finally, close the browser:

javascript

1await browser.close();

The entire script should look like this:

javascript

1const puppeteer = require("puppeteer");  
2  
3const scrapeProducts = async () => {  
4  const browser = await puppeteer.launch();  
5  const page = await browser.newPage();  
6  await page.goto("https://www.amazon.com/");  
7  await page.type("#twotabsearchtextbox", "JavaScript book");  
8  await page.click("#nav-search-submit-text");  
9  await page.waitForNavigation();  
10  const products = await page.evaluate(() => {  
11    let results = [];  
12    const items = document.querySelectorAll(".s-result-item .sg-col-inner");  
13    for (let i = items.length; i--; ) {  
14      const item = items[i];  
15      const title = item.querySelector("h2 > a > span");  
16      const price = item.querySelector(".a-price-whole");  
17      const cents = item.querySelector(".a-price-fraction");  
18      const image = item.querySelector("img");  
19      if (!title || !price || !image) continue;  
20      results = [...results, {  
21        title: title.innerText,  
22        price: parseFloat(`${parseInt(price.innerText)}.${parseInt(cents.innerText)}`),  
23        image: image.getAttribute("src")  
24      }]  
25    }  
26    return results;  
27  });  
28  console.log(products);  
29  await browser.close();  
30}  
31  
32scrapeProducts();

Keep in mind that, in our example, we use the DOM to scrape products. It is often possible to scrape a website using an API instead of scraping the DOM directly. Many websites offer APIs (Application Programming Interfaces) that allow developers to access data in a more structured and organized way.

Using an API to access data can be more reliable and efficient than scraping the DOM directly, as it provides a more stable and predictable way to access the data. APIs typically provide access to data in a machine-readable format, such as JSON or XML, which can be easily parsed and manipulated by software.

And, if for some reason you can't call an API directly with Axios or Fetch for example, you can always intercept the data. Puppeteer provides the page.setRequestInterception method, which allows you to intercept and modify requests made by the page. When you set up request interception, Puppeteer will call a callback function for each request made by the page, and you can modify the request or block it entirely. This can be useful, for example, if you want to block all images to make your scraper more efficient.

Well! Anyway! We have done the hardest part now! Let's run our script!

4. Run and Scrape :)

Run the following command, still inside your directory, to execute the scrape.js file:

bash

1node scrape.js

The output should be an array of objects containing the title, price, and image URL of the products that match the search query:

javascript

1[  
2  {  
3    title: 'Data Structures and Algorithms with JavaScript: Bringing classic computing approaches to the Web',  
4    price: 32.57,  
5    image: 'https://m.media-amazon.com/images/I/71wQtUgJMDL._AC_UL320_.jpg'  
6  },  
7   ...  
8]

That's it! You now know how to scrape Amazon products using Node.js and Puppeteer.

5. Don't be too rough with scraping

There are several reasons why web scraping can be a risky endeavor:

Risk of getting blocked by the website: Many websites have measures in place to prevent web scraping, such as rate limiting, IP blocking, or captchas. If a website detects a high volume of requests coming from a single IP address or user agent, it may assume that the traffic is automated and block the IP address or user agent. This can result in your scraping attempts being thwarted, and potentially even legal action if done without permission.
Impact on website performance: Web scraping can put a strain on the resources of the website being scraped. If a scraper is sending too many requests or scraping too frequently, it can slow down the website for other users or even cause it to crash. This can result in negative consequences for both the scraper and the website.
Website DOM changes: Websites are constantly evolving and changing, and as such, the HTML and CSS code that makes up a website's DOM can change over time. This can result in the scraper not being able to locate or extract the desired data, which can lead to errors and wasted time.
Inaccurate data: Even if a scraper is able to extract data successfully, the data may not always be accurate or up-to-date. This can be due to a variety of reasons, such as errors in the scraper code, differences in data formatting, or changes to the website's data over time. As such, it's important to be mindful of the potential limitations and caveats of the data being scraped, and to use it in conjunction with other sources of information for maximum accuracy.

Puppeteer Documentation : https://pptr.dev/

PS: This article is written in March 2023 (EDIT November 2023). Depending on the evolution of Amazon's DOM, the script may not be working by the time you read this. If so, please let me know.