Scraping Amazon.in

Roshan Khandelwal
17 min readDec 25, 2020

The entire site and not just parts of it.

TL;DR

For those who just want the final result

Product Pages — 224,671

Products — 6.45 million

Database size — 3.6 GB

Network Response — 22.15 GB

Total Time taken — 300 hrs

Average Scraping rate — 12.5 links / min

The scraping rate is awfully slow, along with the time taken. But is okay for a L&D project.

Where it all started?

I wanted to know, just like that —

  1. How many unique products does Amazon ( India ) sell, across all of its categories?
  2. What would be the total Data consumed, if I were to access all product list pages?

There are no right answers here. And it’s really a snapshot. Maybe, by the time, the crawling program completes, a few more products would have been added.

I am not looking to compare prices across e-commerce sites or even compare prices on amazon across days. It just started as an idea, and i chose to pursue it.

This is not really a post, but a narrative. I am writing it as I am exploring the options and adjusting my strategies on the go.
So, you might find contradictions. Things written towards the end might differ from those at the beginning. Some might feel out of place, out of order.
But I like this style. It allows for understanding the evolution of a project, instead of just dumping the final solution.

I will choose to be a Responsible Scraper

  1. I would first check for the existence of an API.
  2. Let me find robots.txt for amazon.in. That might explicitly indicate that crawling is not allowed (or allowed only partially, or the crawl rate is limited).
  3. I need to set an adequate crawl rate — one request every sec. Amazon probably can handle way more than this, but i would stick to this.

and finally

I will Identify my scraper bot via a legitimate user agent string.

This indicates that my intentions are, in fact, not malicious. In the user agent string, I would also link to this article, which explains why do i need the data.

What am I going to do with the data?

What am i going to pick?
Just the title and description should be good enough. Along with this, some id which uniquely identifies each product, the price and rating.
Would be able to use it for a few charts, I have thought of creating with the data.

Store it.

I could store it in Mongo — via their generous Atlas offering of 512 MB.
Or use Firebase — 1GB of data in their Free plan.

Eventually i might put the data into Elastic Search — Bonsai — free sandbox cluster ( 125MB, 10K Documents ) — which creates a cluster in AWS US East. Take amazon data and put it in AWS. No Love lost.

A document is a JSON document which is stored in Elasticsearch. It is like a row in a table in a relational database. Each document is stored in an index and has a type and an id.

Well with elastic search, i could further create a Search application — Full text search on title and description.

But the Free tier is too limited. I think Amazon has more than 10K products :)

I will pick Firebase for now.

What should I use for Scraping?

I searched. There are a lot of libraries.
Python — Scrapy ( technically not even a library… it’s a complete web scraping framework )., Beautiful Soup,
NodeJs — jsdom, Cheerio, Puppeteer, Playwright

and then there’s UiPath. This might be an overkill. UiPath is for automation. I am not automating anything. Just scraping.

UIPath uses an proprietary visual language. In addition it supports the VB.NET language, a language that is rarely used otherwise.

What skills do i have?

I am a JS and Python developer. ( Sorry, I know Python. Have used it for a couple of personal projects.)

There are also companies out there, who would do the data scraping for me -Luminati, Outwit — with its somewhat creepy tagline

and there are thousands of worthy articles on how to use Scrapy for Web crawling.

I will take the path of least resistance — will use Scrapy + Selenium. ( Link to all articles that helped me will be at the end — Credits )

Where would I be scraping from?

Local machine on Windows. Naa!!

Scrapy & Python, as such seem to work best with Linux based systems. and I am also inclined towards IDE based development.
So, would be doing this from a VM on my local machine ( via Oracle VirtualBox ), running Ubuntu 19, and using VsCode as the IDE.

Final Stack

Subject — Amazon.in

Storage — Firebase

Crawler — Scrapy ( Python ) + Selenium

IDE — Visual Studio Code

Environment — Ubuntu 19 within Oracle VirtualBox

Step 1 — Check for existence of APIs (24th Nov, 2020)

There is an API — https://developer.amazonservices.com/ — Selling Partner API (SP-API)

  1. Private Developers — You can build applications for use with your own Amazon seller account. This requires you register as a developer.
  2. Public Developers — You can build applications that sellers authorize and use to help manage their Amazon business. This requires you to register as a developer.

I am not looking to do any of this.

Amazon APIs collection on Rapid API — There are two existing APIs, which return product data, ( and are pretty well updated ) but they are private APIs, nothing that is officially supported.

I will continue to look, but for now, I will proceed with scraping of the website.

Step 2 — Check for existence of robots.txt

The robots file is located at http://www.website.com/robots.txt. It lets search engine crawlers know what parts of your website you do not want them to crawl. It is the very first location of your website that a search engine will visit.

From — https://www.boostability.com/how-to-verify-that-you-have-the-proper-robots-txt-file/

You can read more about the specific sections of robots.txt in the Link above.

I read up on https://www.amazon.in/robots.txt — and

  1. User-agent: * Good, It means i am allowed.
  2. 147 lines of Disallow. — /exec, /dp, /gp, etc.
    One of the confusing ones ☹, is */s?k=*&rh=n*p_*p_*p_

But since i will not be doing mindless scraping, and would just be looking at product list pages, I don’t think, I would be hitting, any of these.
This should keep me in the good books of amazon & hopefully still get all the products.

Step 3 — Adequate Crawl Rate

From https://sitebulb.com/resources/guides/how-to-crawl-responsibly-the-need-for-less-speed/

I came to know about CDOS. and I sure don’t want to be responsible for CDOSing amazon. ( as if i could!! ), but this site mentions 5Urls/sec .

Crawl the site slowly. By sticking to something like a 5 URL/s limit you are automatically throttling based on TTFB, view this as a positive.

I would probably do 2 URLs/sec

If i went at 2 URLs / sec — it would be 120 urls a min, 7200 urls an hour and 72000 urls in 10 hrs. That should be a lot of products.

Let’s see, provided I do not get blacklisted, before this :)

Crawling Strategy — Understanding the Subject

Amazon.in is HUGE. Has lots of links, and surely needs a strategy, otherwise i might end up looking at the same links again & again.
I am also not looking to do mindless scraping, i.e follow all the links on the page, and read everything.

I would need to look past and ignore the UX, the Information Architecture, Taxonomy, Sales pitches, Machine Learning induced recommendations, Sponsored ads etc, to get to just the raw data.

🗹 LVL 1. Top Search Bar

has a dropdown, which lists all categories.

Of the 43 categories, i am picking just 27.
and am ignoring Deals, Alexa Skills, Under ₹500, Amazon Devices, Pantry, Fresh, Fashion, Apps, Movies. Music, Video games, Prime etc.
1. Under ₹500, Deals, Devices, Pantry, Fresh, Fashion, More store
— are simply forms of organizing information and presenting them as separate offerings. ( Information architecture )
2. Alexa Skills, Mobile Apps, Music, Prime, TV Shows— are S/W and media, and there is no limit to what can be offered.

Since these are the high level links, I will manually find these links and begin from here.

🗹 LVL 2.

All of these pages, have a Show results for section, which has 2nd level nav links within this category. So i would create an array — to store each of these links, and this would be the array of all links at lvl 2.

🗹 LVL 3.

Clicking on this 2nd level link, takes you to a third page — which after all of the fanfare & recommendations — eg. for Books ( Most gifted, Hot new releases, Most wished for, Best sellers, Top rated etc. ) , might have a Pagination bar which, will allow me to view the subsequent set of products.

This page also possibly has a list of level 3 links ( indentation ) , which further breaks up the offering into more categories.
This would be all the links to all the pages, containing products that Amazon sells.

Things I am going to skip

✘ There is also a Refine By section, but I am not interested in that.

The category Under ₹500 is particularly interesting, since the value for this option is search-alias=under-ten-dollars, which if converted to Indian rupees, would be ~ ₹750, but is labelled under 500.

With the complexity involved, I decided to make a simplified Flowchart. Flowcharts are good. They help you to visualize the process.

Amazon has different HTMLs for its main pages and subsequent result pages. Page 1 results are ul and li, whereas inside pages are div and div ( Semantic anyone!! ) for list.
Inside pages have probably not been touched since a long time.

Based on the research, I have found couple of inconsistencies, the way these pages are organized

Couple of inconsistencies, in the way these pages are organized
Pagination and total Results probable mismatch — just one eg.

Shouldn’t comment on this.

But, as a developer, I try to maintain consistency, across pages, css classes etc. This pagination thing, would surely have been a Bug. Probably a Critical.
And here is a company, that is okay with these inconsistencies.

Maybe just that as a business, you need to focus on what you do the best. Use other tools and technologies for enablement, and not get stuck with it Or maybe they are just working on it.

Lots of teams, working on lot of pieces separately, sometimes might just result in this.

Software Set Up

  1. Installed Oracle VirtualBox . Created a new VM ( 4GB ram, 21 GB VHD ) running Ubuntu 19, and installed Guest Additions to allow it to go Full screen.
  2. Ubuntu 19 ships with both Python 3 and Python 2 pre-installed. To make sure that our versions are up-to-date, update and upgrade the system with apt-get
sudo apt-get update
sudo apt-get -y upgrade ( still goes only till Python 3.7.5, but is good enough for me. )
sudo apt-get install -y python3-pip ( Installed pip )
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
sudo pip install scrapy

I didn’t bother creating a virtual env, since this VM exists only for scrapy.

scrapy shell amazon.in
view(response) - this opens the response in a browser.

3. Install Visual Studio CodePython tutorials

4. Create a new file main.py

import scrapy
print(scrapy.__version__) => 2.4.1

Firebase

ProjectName — amazonScraper
Google Analytics setup — yes ( it’s free )

Step 1sudo pip3 install firebase-admin

Step 2 — To authenticate a service account and authorize it to access Firebase services, you must generate a private key file in JSON format.

Step 3 — Created a new Cloud Firestore, in Test mode ( to begin with )

and then added this in main.py

import firebase_admin
from firebase_admin import credentials, firestore
cred = credentials.Certificate("./scraper-4a069-firebase-adminsdk-p8pau-b79e08a748.json")
amazon_scraper_app = firebase_admin.initialize_app(cred)
db = firestore.client()
db_collection = db.collection('products')

Selenium

https://selenium-python.readthedocs.io/getting-started.html

Step 1pip3 install selenium selenium-3.141.0

Step 2 — Installing geckoDriver for Firefox. ( my choice of browser )

wget https://github.com/mozilla/geckodriver/releases/download/v0.28.0/geckodriver-v0.28.0-linux64.tar.gz

tar -xvzf geckodriver*
chmod +x geckodriver
sudo mv geckodriver /usr/local/bin/

FROM — https://askubuntu.com/questions/870530/how-to-install-geckodriver-in-ubuntu

and added this in main.py — just a smoke test

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.amazon.in")
assert "Amazon" in driver.title
driver.close()

That wraps up the Set Up. Will now move to the actual scraping code.

Coding Time

Phase 1 — Links Collection

scrapy startproject amazon_scraper

Spider 1 — get all lvl2 links

For each of the Lvl1 links, get links one level deep ( Lvl2 Links ) response.xpath(“//*[@id=’leftNav’]//*[contains(@class, ‘s-ref-indent-one’)]//li//a/@href”).getall(), or response.css(“li.apb-browse-refinements-indent-2 span.a-list-item a”).xpath(“@href”).getall()

27 Lvl1 links, yield 323 Lvl2 links

These are written to a file — lvl2_links.json

Spider 2 — get all lvl3 links

Follow each Lvl 2 link, check if they have Lvl3 Links, get those, else assign lvl2 as lvl3, ( probably this one doesn’t break up further )
response.xpath(“//*[@id=’leftNav’]//*[contains(@class, ‘s-ref-indent-two’)]//li//a/@href”).getall(),

323 lvl2 links, yield 1959 lvl3 links, taking the total scrapable links to 2282.

These are written to a file — lvl3_links.json.
Overall, this entire process takes around 42 mins, and the reason why it does so is because there’s a lot of wait time added.

Each of these 2282 links, could potentially have hundreds of pages, with 20+ products listed on each page. (Some of them might have no pages linked).

Also, there might be a lot of repetition. ( sales, sponsored, category distribution etc. ), but i will keep all repetitions. That gives a different kind of metric.

Given the scraping rate, and the fact that each of these links would point to hundreds of other pages, I would not be running the entire ops in one shot. So would need some mechanism to keep track of what i have already covered.

Need a distributed scraping and storage strategy.

Phase 2 — Strategy Revision. ( Dec 6th )

I have been thinking about the options for data storage. Firebase, Mongo, S3. Given the time it will take to scrape and sheer number of records,

I want to keep the storage simple, generic and not commit at this point to a specific provider.

But i don’t want to dump it to a single file.

  1. Pick the first link. Does it have results? Scrape them.
  2. Does it have a pagination bar? Yes — Set this link to processing ( also some way to identify, which page of this link ), No — Mark this as complete.
  3. Follow the next Page
  4. Check if this was the last page of the originating link — Yes? Mark the originating link as complete.
  5. Continue — Pick the next link.

Next time, when i start the process, it will look for the link — If complete, then move on to the next in sequence. Else will pick the last processed page for the link and process from there.

What should I use for maintaining this status?

I think, an in memory DBSqlite

import sqlite3
conn = sqlite3.connect('products.db')
'''CREATE TABLE IF NOT EXISTS LINK_STATUS(
link TEXT NOT NULL,
link_status CHAR(50),
current_page_num INT,
current_page_link TEXT);
'''

Where should i store the products?

I did a bit of research.

Excel could work. Even though it allows for a million rows, there are other ways of handling much more data than this. ( Data Model )

Because Data Model is held in your computer memory rather than spreadsheet cells, it doesn’t have one million row limitation. You can store any volume of data in the model. The speed and performance of this just depends on your computer processor and memory.
https://chandoo.org/wp/more-than-million-rows-in-excel/

CSV, JSON, Parquet

Parquet is a columnar file format whereas CSV is row based. Columnar file formats are more efficient for most analytical queries.

But again, I will be using Sqlite. Is local, free and still gives me all the capabilities of viewing and querying. ( Later on, will decide to move the data to some other storage or managed service. )

Single table or multiple tables for products?

2282 links, I would distribute these across 10 tables ( based on modulus 10 of the originating link index ). Might not be the best way to break up. Could end up with skewed tables.

But still it would be better than dumping everything in one table. ( Based on the results, maybe would round-robin next time )

for i in range(1, 11):
table_name = 'PRODUCTS_{}_DATA'.format(i)
cur.execute("CREATE TABLE IF NOT EXISTS " + table_name + " (link_url TEXT NOT NULL, data json, origin_url TEXT NOT NULL)")
conn.commit()
print("PRODUCTS Table created successfully")

I also came across another project — Scrapyd, which I could use to run my spiders, running multiple instances of the spider passing it separate parts.
and If i do get banned, then will look at one of the alternatives

Use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh. An open source alternative is scrapoxy, a super proxy that you can attach your own proxies to.
https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Phase 3 — Product collection and Saving to file

Pilot Run — 1

Started with 10 links to begin with, since that would put products in all of the individual tables.
It’s small enough, and yet big enough to test all combinations.

Captcha — Challenge

Did run into a lot of issues, including the Captcha. Fetching via scrapy, gives out responses like this.

There are solutions to this. Here and Here.

I chose a different approach, however. Scrapy First — No Products — Try Selenium — Pass the response to Scrapy.
Worked out well

Stats — Pilot

  1. These 10 Lvl3 links pointed to a total of — 2029 product pages, and on an average had 20 products per page, giving out 40,500 products
  2. For these 2029 pages, the code encountered 29 Captcha challenges.
  3. The code ran for around 2hrs.

4. Data Table distribution — This totals to 2072. ( which is 43 more than 2029).

29 of these 43, are for Captcha responses. Even though they resulted in zero products, it was still persisted, and the rest 16 come from repeated runs.

Extrapolating the time taken for 10 links ( ~120 mins ) to 2282 links, The entire scraping process should take around 28,000 mins ~ 466 hrs.

Pilot Run — 2

Increasing the total to 100 links.

Stats

  1. These 100 Lvl3 links pointed to a total of — 19,675 product pages, and on an average had 20 products per page, giving out 3,93,500 products
  2. For these 19,675 pages, the code encountered 357 Captcha challenges.
  3. The code ran for around 14 hrs 30 mins.
  4. Data Table distribution
Created with https://spark.adobe.com/express-apps/chart/

Extrapolating the time taken for 100 links ( 16hrs ) to 2282 links, The entire scraping process should take around 22,000 mins ~ 365 hrs.

Definitely there is a reduction, of almost 100 hrs, & the distribution isn’t skewed.

Final Run

Time to run it for the entire lot of 2282.

  1. Knowing that it would run through a couple of days, I decided to run it through the day [ ~9am to ~12pm ], and let the system sleep over the night.
  2. I did kind of Ctrl + C the scraper a few times, so the entire time is broken into a few runs.
  3. Lost the time stats for 10th to 15th, since one of my kids just switched off the system. It’s all in the log files, just not the start and end time. The system however ran for the entire day & slept through the night. ( probably ~15 hrs everyday )
  4. Some of these links had pagination bars, which had a ‘See More Results’ . This is something that i hadn’t accounted for.
  5. There are 58 links, that haven’t moved to completion, even after multiple retries. Would need to look at them separately.

So, taking all of this in account, the entire process took — 10,18,320 secs

Stats

  1. The 2282 Lvl3 links pointed to a total of — 2,24,671 unique product pages, which have a total of 73,53,376 products listed ( ~7.3 million ), with an average of 32 products per page.
  2. The entire Database is of 3.6 GB. So anything free ( Mongo, Firebase etc. ) would not work.
  3. The code ran for around 282 hrs.
  4. All of these product pages, also have sponsored products, which are repeated across all of the pages for a particular Lvl3 link. So @4 products per page ( conservative estimate ) 900,000 products would be repeated.

Product Pages — 224,671

Products — 6.45 million

Database size — 3.6 GB

Network Response — 22.15 GB ( doesn’t include the image sizes )

Total Time taken — 300 hrs

Average Scraping rate — 12.5 links / min

These are approximate figures. And the products might not be unique yet. Might be repeated across links.

What’s nEXT

Now, equipped with the knowledge that there are 2.24 lakh product links, and that my scraping rate has been 12.5 links / min, I realize 300 hrs (12.5 days) is a huge amount of time.

I could certainly remove a few sleep statements. If I were to achieve the intended scraping rate of 2 links / sec, the entire thing would complete in 31 hrs.

Log Files ( 298 MB )

Looking at these would help me uncover new things. Failed requests, retries, captcha challenges.

Enhancements

Move this data from the current sqlite3, across to probably an Elastic stack, run a couple of visualizations ( Kibana ), understand search, power a text only version of the data ( for dev & learning purposes only ). No Images.

If I were to run scraping again, I would probably split these links across multiple spiders, probably take a couple of machines on AWS.

22 On-Demand Machines ( t4g.small ) scraping 10,000 links per machine, at the current dismal rate of 12.5 links/min and kept running for 15 hrs @$0.0168/hr each, would cost around — $5.54 ~ ₹408 / run

It could be a lot cheaper, If it were running at a link per sec, which would just be 3 hrs per machine, totalling to $1.1 ~ ₹81 / run

It does give me an insight into how comparison sites, might work. Scan once, scan often, keep updating your local copy, enable a quick search and you might just make a product that can sell.

Finally, If you have reached here, and are thinking, these could be just made up numbers, well you will have to wait sometime, until I work on the enhancement, and maybe publish a public link.

For now, this video of the log file should help. https://www.loom.com/share/e5f3cef9a8224da7af112df7334b9734

This has all of the 224,671 links, scrapy runs, errors, retries everything.

This would be all then at this moment. Would work on the data and publish something more in a separate post.

Here is an AIRTABLE for all the links that helped me. https://airtable.com/shrd1QW0hPxm0NtdZ

--

--