Web Scraping with the Right Methods and Tools

Shubham Mohape
Python in Plain English
9 min readOct 10, 2023

--

Web Scraping text art from Adobe Firefly

Web scraping refers to the automated gathering of data from websites, like a bot that navigates through the internet and records the desired information. Supposing you want to monitor the prices of your competitor’s products to adjust your own pricing strategy. For capturing these details, you can use web scraping to automate the process of collecting pricing information from their websites and make an informed decision for your own product pricing.

Before taking up any web scraping projects, it’s crucial for programmers to acquaint themselves with the best practices that are followed by the industries currently. Therefore, this blog will explore the concept of Feasibility Assessment required to perform web scraping, and provide insights into how programmers can become skilled at evading detection by the anti-scraping measures implemented by websites.

By adhering to these best practices, programmers can make informed decisions regarding the feasibility of the task and move forward with their web scraping project in a manner that is efficient.

Section A: Assessing the Feasibility of Web Scraping

I follow specific technical steps to assess the feasibility of scraping data from a website.

Step 1: Familiarity with Web Inspection Tools

Familiarity with the Inspect and View Page Source options is crucial when assessing the feasibility of web scraping from a website.

(Web Inspection Tools)

View Page Source shows the static code of a web page as it was initially loaded.

Inspect lets you interactively examine and modify the live, rendered version of a web page, including its HTML, CSS, and JavaScript.

Step 2: Using BeautifulSoup to Parse Web Data

If you’ve located the data you need in the ‘View Page Source’ of a website, you’ve already completed half of your task. Simply perform a standard GET request to the website and then utilize BeautifulSoup (bs4) to transform the response text into a structured HTML parser tree.

import requests
from bs4 import BeautifulSoup

url = "https://www.amazon.in/s?k=laptop"

response = requests.get(url)

if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')

Step 3: Exploring the Inspect Page

In certain situations, the data you need for web scraping might not be readily available in the View Page Source section of a webpage. In such cases, you’ll need to switch to the Inspect page, where you’ll find various tabs, including Element ,Performance, Network and others. However, we only need the Element and Network tabs for our scraping purposes.

Start by making sure the information you couldn’t see on the View Page Source page is now discernible on the Element tab of the Inspect page.

Let’s now show how to retrieve such concealed info.

Step 4: Retrieving Data from the Network Tab

To begin, navigate to the Network tab and clear the network log. Afterward, refresh the page. In some instances, you may need to scroll through the log to capture the entire dataset.

The approach I follow is straightforward: I search for the specific data keywords in the Search tab. In my case, I’ve found the desired results in the general All tab. Occasionally, the data may be located in the Fetch/XHR tab, but we’ll discuss that aspect later. For now, let’s concentrate on the extracted content.

Now use Copy as CURL (Bash) from Copy option in order to extract curl content in bash format for that specific request

(Locating Data in Network tab)

Now that you have the cURL content for the necessary request containing the desired data, you can proceed with further processing. For this purpose, you can opt for either Postman or curlconverter.

Step 5: Further Processing in Postman

I particularly favor Postman because it offers a wider array of options for conducting in-depth analysis on the request.

Next, paste the cURL request into the Raw Text tab within Postman. Once you’ve done that, click Continue, and you’ll be presented with both the Request and Response tabs specifically for that particular cURL request.

(Postman Interface for the cURL request)

Now, onto the most intricate part. It’s possible that you’ll receive a 200 response, and at this stage, you can transform the subsequent request components into code using the features available in Postman.

Step 6: Cleaning Up HTTP Request Headers

However, before proceeding, I recommend performing some cleanup, particularly in the Headers section. There might be extraneous header keys that are unnecessary for the request or could potentially hinder us from sending multiple requests efficiently.

In the HEADERS tab of an HTTP request, you’ll encounter a variety of headers, with some of the most common ones including Cookies, User-Agent , Accept, and Accept-Language. These headers play crucial roles in specifying preferences, identifying the client, and providing information about the requested content and its language.

For more detail refer below link : HTTP headers

You can streamline your HTTP request headers by initially including all optional ones and gradually trimming them down through trial and error. Over time, you’ll gain insights into which headers are essential and which can be omitted. As you reduce them, always verify that the provided data remains accurate to ensure your requests work effectively.

Step 7: Processing Response Data

To process the response data, you have a couple of options. For HTML content, you can utilize BeautifulSoup (often referred to as bs4) to create a parsing tree and employ CSS selectors to extract the desired data. Alternatively, for XML content, you can utilize lxml to construct an XML tree and then utilize XPath to perform the same data extraction. These techniques offer powerful ways to navigate and extract information from the response data depending on whether it’s in HTML or XML format . Refer below code to Parse html content with xpath

import requests
from lxml import html

url = "https://www.amazon.in/s?k=laptop"

response = requests.get(url)

if response.status_code == 200:
page_content = html.fromstring(response.text)

xpath_expression = "//span[@class='a-text-normal']"

product_titles = page_content.xpath(xpath_expression)

for title in product_titles:
print(title.text_content())
else:
print("Failed to retrieve the web page.")

Section B : Scraping via API

In some cases, websites may not provide their data directly in HTML format. Instead, they prefer to deliver their data through an API, which the web page utilizes to fetch and display information. When you inspect the network activity of such websites, you can often find the data exchange in the Fetch/XHR tab, where the responses are typically in JSON format.

No matter what the subsequent process remains the same, copy the curl (bash) content and paste it into Postman and the process continues

Section C : Tools -Selenium, Playwright, and Scrapy

When it comes to web scraping and automation, choosing between tools like Selenium, Playwright, and Scrapy depends on your specific needs and the nature of the task at hand.

Selenium and Playwright are invaluable tools when dealing with scenarios that go beyond static HTML or JSON data retrieval. Here are some reasons to consider using them:

  1. Dynamic Websites: Selenium and Playwright excel at handling dynamic websites where content changes based on user interactions.
  2. Interacting with Web Page Elements: These tools provide a user-friendly way to interact with web page elements, such as clicking buttons, filling out forms, and extracting data.
  3. Cross-Browser Compatibility: Selenium is particularly known for its cross-browser compatibility, allowing you to automate tasks across different web browsers.
  4. Background Automation: You can use Selenium and Playwright to scrape content in the background without displaying a visible browser window, which is crucial for unobtrusive automation.

Here’s a simple Python code snippet example using Selenium to open a amazon website and print the page title and price:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path="path_to_chromedriver.exe")

driver.get("https://www.amazon.in/s?k=Mobiles")


wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "a-size-medium.a-color-base.a-text-normal")))

mobile_titles = driver.find_elements(By.CLASS_NAME, "a-size-medium.a-color-base.a-text-normal")
mobile_prices = driver.find_elements(By.CLASS_NAME, "a-price-whole")

for title, price in zip(mobile_titles, mobile_prices):
print("Mobile Title:", title.text)
print("Price (INR):", price.text)

# Close the browser
driver.quit()

To get started with Selenium in Python, refer to the Selenium Python Documentation. and for Playwright’s Python documentation, you can check the Playwright Python Documentation.

Now, let’s delve into why Scrapy might be your preferred choice in certain situations:

  1. Efficiency: Scrapy is a faster option for large-scale web scraping tasks because it can handle multiple operations simultaneously, while Selenium behaves more like a web browser.
  2. Structured Data: Scrapy is well-suited for collecting structured and organized data, making it an excellent choice for tasks like extracting product information or articles.
  3. Website Interaction: If you only require data extraction and not interactions with the website (e.g., form submissions), Scrapy provides a more streamlined solution.
  4. Respecting Rules: Scrapy follows website rules and guidelines, aiding in ethical web scraping by default. Selenium doesn’t adhere to these rules automatically.
  5. Scalability: Scrapy is designed to handle numerous scraping tasks concurrently, making it ideal for large-scale projects where Selenium may slow down.
  6. Headless Mode: Scrapy can operate in the background without a visible web browser, making it efficient for running on servers. Selenium can achieve this as well but with less efficiency.

Here’s a basic example of a Scrapy spider to scrape data from Amazon mobile category page

import scrapy

class AmazonMobilesSpider(scrapy.Spider):
name = "amazon_mobiles"
start_urls = ["https://www.amazon.in/s?k=Mobiles"]

def parse(self, response):
for product in response.css("div.s-result-item"):
title = product.css("span.a-size-medium.a-color-base.a-text-normal::text").get()
price = product.css("span.a-price-whole::text").get()

if title and price:
yield {
"Title": title,
"Price (INR)": price,
}

next_page = response.css("li.s-pagination-next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)

Section D : Rate limiting and Proxies

Rate limiting is an essential techniques in web scraping to ensure that your web scraping bot or script doesn’t overload a website’s server with too many requests in a short period of time. Implementing these measures helps you avoid being blocked by the website and ensures that you scrape data in a respectful and responsible manner.

Rate limiting involves controlling the number of requests your scraper makes to a website within a specific time frame, such as requests per second (RPS) or requests per minute (RPM). It ensures that your scraper doesn’t bombard the website’s server with an excessive number of requests, which could lead to IP bans or other restrictions. Here’s how you can implement rate limiting:

  1. Set a Request Limit: Determine the maximum number of requests you want to make to the website within a given time interval. This limit will depend on the website’s terms of service, the server’s capacity, and your own scraping requirements.
  2. Calculate the Delay: Calculate the delay between each request to stay within the rate limit. The delay is the inverse of the desired RPS or RPM. For example, if you want to limit your requests to 10 RPS, the delay between each request should be 0.1 seconds (1 / 10).
  3. Introduce Sleep or Pause: Before making each request, introduce a sleep or pause in your scraping script for the calculated delay time. This ensures that you don’t make requests too quickly.
import time
import requests

request_limit = 10
base_url = "https://amazon.com/api/data"

for i in range(request_limit):
response = requests.get(base_url)
time.sleep(1 / request_limit)

When you’re sending a large number of requests, especially in a short period, there’s a higher risk of getting blocked by the server. Servers can detect unusual patterns and high-frequency requests, which may trigger rate limiting, IP blocking, or other protective measures.

When facing scenarios where you need to work with a large number of requests, integrating proxies into your requests can be crucial. You can find a guide on how to do this effectively by visiting the following link: Python Request Proxies

In Plain English

Thank you for being a part of our community! Before you go:

--

--