Web Scraping Using Python and Selenium

A guide to web scraping using Python and Selenium.

Published in

Python in Plain English

5 min readJun 20, 2020

Photo by Cesar Carlevarino Aragon on Unsplash

In today’s world, we are consuming a huge amount of data from multiple web sources like email, news feeds, social network feeds, etc.

Similarly, for analysis, understanding, and development data scientists required data. That data could be structured (tabular) or unstructured (text, image) depending on their use case and requirement.

The Internet is a rich source of data. We can get any data from the internet irrespective of the type of data. But extracting data manually will be time taking and boring. For that, we can automate the process of extracting data. That process is known as web scraping.

What is web scraping?

Web scraping is a tool of extracting unstructured data from web pages and convert into structured and machine-readable data.

Web scraping is very useful it has multiple use cases such as:

Data Mining
Product Reviewer
Weather monitoring
Price Monitoring - Price Comparison, Competitor Monitoring
Market Research - Market Analysis, Market trend analysis
Content Monitoring
and many more

In this post, we will be discussing Python, Selenium, and relative libraries.

What is Selenium?

Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language. (Source)

There we are using Selenium for automating web browsers to reach desired data.

Setting up a System

In order, to start scrapping, we have to install libraries and tools.

The following commands will install Python3 in the Ubuntu 18.04+ system. For Windows or macOS please download Python and install it.

sudo apt-get update
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
sudo apt-get install python3-pip

Setting up a virtual environment.

Navigate to the target folder and run the following command.

pip install virtualenv
virtualenv -p python3 .venvfor windows
.venv/Scripts/activate.batfor mac/ubuntu 
.venv/bin/activate

After activating the virtualenv. The command-line screen looks like this.

Now start installing the libraries:

pip install selenium pandas bs4

Selenium: web automation library

pandas: It is the most popular library in data science. It uses for data storage, manipulation. It can read/write data in multiple formats.

bs4: BeautifulSoup is a code-level data extraction tool. This tool is very useful when we extract data from markup-based (HTML, XML, etc.)

Once you install all the libraries. We are good to go for implementation.

Code Implementation

For the blog, We are scrapping amazon products specific to a particular key. Here this code will initialize the browser instance and browse to amazon.com

Oops, we got an error. This error is showing that we don’t have a driver.

Selenium has a tool called selenium driver. It is a robot which is handling all the action instructed in code.

Without a driver tool, we can’t automate the process. For the post, I am using Mozilla Firefox as the browser. For Firefox, Selenium has Geckodriver as a web driver. Similarly, you can check for other driver tools here.

After downloading the web driver and providing the execution path in the program.

Great… We’ve reached the web page. Need to mention the flow.

Find search box, enter keywords, and search
Find product list. Extract product URL.
Browse to product URL
Extract and save product details in CSV/JSON.

1. Find search box, send key, and click the search button

In DOM, every tag is an element. We need to find the position of an element in an HTML document. For searching in selenium, we can use multiple sources like XPath, tag name, classpath. We are considering Xpath.

input_box_xpath = '//*[@id="twotabsearchtextbox"]'search_button_xpath = '//*[@id="nav-search"]/form/div[2]/div/input'

We have the element, we need to send keywords and click on the search button.

2. Find product list and extract product URL

Similar to 1st step, we are finding the element XPath and looping through all the products.

Here, all the products are part hierarchical ‘div' tags. We need to extract from those.

xpath = "/html/body/div[1]/div[1]/div[1]/div[{}]/div/span[3]/div[2]/div[{}]/div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a"

Here we have an XPath pattern for each product in the list. In pattern, I have mentioned ‘{}’ where the pattern is changing for an individual product.

3. Iterate and browse the product list

Here, this small function will extract & append all the product URLs in a list and return them.

4. Extracting and saving the product details

We are extracting product name, price, ratings for now.

product_name = "#productTitle"
product_price = '#price_inside_buybox'
product_ratings = "span.arp-rating-out-of-text"

Conclusion

In the post, I have discussed how to perform web scraping on dynamic websites and listed the components used by Selenium for automating the process to reach the desired data.

Please click the Clap button if you like the post.

Top 5 Instant Data Scraping Tools for Easy Web Scraping

Bright Data, ParseHub, Apify, Octopase, Mozenda. There are plenty of instant web scrapers to choose from. How to pick…

javascript.plainenglish.io

Perform Sentiment Analysis on Tweets Using Python

How to perform sentiment analysis on tweets using pandas, NumPy and seaborn — and how to best obtain tweet data.

python.plainenglish.io

How to Scrape Amazon Product Reviews and Perform Sentiment Analysis

Scraping Amazon product reviews and performing sentiment analysis: a step-by-step guide.

javascript.plainenglish.io

How to Scrape Public Data from LinkedIn

Unlocking Insights: A Comprehensive Guide to Ethical Public Data Scraping from LinkedIn

javascript.plainenglish.io

5 Best Tools to Scrape Data from Any Website in 2023

Discover the top 5 data scraping tools for effortless website data extraction in 2023.

javascript.plainenglish.io

Top 5 Tools to Bypass CAPTCHAs for Web Scraping

Bypass CAPTCHAs and ensure a seamless web scraping process with tools such as the Scraping Browser, Puppeteer Extra…

javascript.plainenglish.io

More content at plainenglish.io. Sign up for our free weekly newsletter here.

Web Scraping Using Python and Selenium

A guide to web scraping using Python and Selenium.

What is web scraping?

Web scraping is very useful it has multiple use cases such as:

What is Selenium?

Setting up a System

Code Implementation

1. Find search box, send key, and click the search button

2. Find product list and extract product URL

3. Iterate and browse the product list

4. Extracting and saving the product details

Conclusion

Further Reading

Top 5 Instant Data Scraping Tools for Easy Web Scraping

Bright Data, ParseHub, Apify, Octopase, Mozenda. There are plenty of instant web scrapers to choose from. How to pick…

Perform Sentiment Analysis on Tweets Using Python

How to perform sentiment analysis on tweets using pandas, NumPy and seaborn — and how to best obtain tweet data.

How to Scrape Amazon Product Reviews and Perform Sentiment Analysis

Scraping Amazon product reviews and performing sentiment analysis: a step-by-step guide.

How to Scrape Public Data from LinkedIn

Unlocking Insights: A Comprehensive Guide to Ethical Public Data Scraping from LinkedIn

5 Best Tools to Scrape Data from Any Website in 2023

Discover the top 5 data scraping tools for effortless website data extraction in 2023.

Top 5 Tools to Bypass CAPTCHAs for Web Scraping

Bypass CAPTCHAs and ensure a seamless web scraping process with tools such as the Scraping Browser, Puppeteer Extra…

Written by Ritik Jain