Web Scraping Using Python and Selenium

A guide to web scraping using Python and Selenium.

Ritik Jain
Python in Plain English

--

Photo by Cesar Carlevarino Aragon on Unsplash

In today’s world, we are consuming a huge amount of data from multiple web sources like email, news feeds, social network feeds, etc.

Similarly, for analysis, understanding, and development data scientists required data. That data could be structured (tabular) or unstructured (text, image) depending on their use case and requirement.

The Internet is a rich source of data. We can get any data from the internet irrespective of the type of data. But extracting data manually will be time taking and boring. For that, we can automate the process of extracting data. That process is known as web scraping.

What is web scraping?

Web scraping is a tool of extracting unstructured data from web pages and convert into structured and machine-readable data.

Web scraping is very useful it has multiple use cases such as:

  • Data Mining
  • Product Reviewer
  • Weather monitoring
  • Price Monitoring - Price Comparison, Competitor Monitoring
  • Market Research - Market Analysis, Market trend analysis
  • Content Monitoring
  • and many more

In this post, we will be discussing Python, Selenium, and relative libraries.

What is Selenium?

Selenium is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language. (Source)

There we are using Selenium for automating web browsers to reach desired data.

Setting up a System

In order, to start scrapping, we have to install libraries and tools.

The following commands will install Python3 in the Ubuntu 18.04+ system. For Windows or macOS please download Python and install it.

sudo apt-get update
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
sudo apt-get install python3-pip

Setting up a virtual environment.

Navigate to the target folder and run the following command.

pip install virtualenv
virtualenv -p python3 .venv
for windows
.venv/Scripts/activate.bat
for mac/ubuntu
.venv/bin/activate

After activating the virtualenv. The command-line screen looks like this.

Now start installing the libraries:

pip install selenium pandas bs4

Selenium: web automation library

pandas: It is the most popular library in data science. It uses for data storage, manipulation. It can read/write data in multiple formats.

bs4: BeautifulSoup is a code-level data extraction tool. This tool is very useful when we extract data from markup-based (HTML, XML, etc.)

Once you install all the libraries. We are good to go for implementation.

Code Implementation

For the blog, We are scrapping amazon products specific to a particular key. Here this code will initialize the browser instance and browse to amazon.com

Executable path error

Oops, we got an error. This error is showing that we don’t have a driver.

Selenium has a tool called selenium driver. It is a robot which is handling all the action instructed in code.

Without a driver tool, we can’t automate the process. For the post, I am using Mozilla Firefox as the browser. For Firefox, Selenium has Geckodriver as a web driver. Similarly, you can check for other driver tools here.

After downloading the web driver and providing the execution path in the program.

Execution Screenshot

Great… We’ve reached the web page. Need to mention the flow.

  • Find search box, enter keywords, and search
  • Find product list. Extract product URL.
  • Browse to product URL
  • Extract and save product details in CSV/JSON.

1. Find search box, send key, and click the search button

In DOM, every tag is an element. We need to find the position of an element in an HTML document. For searching in selenium, we can use multiple sources like XPath, tag name, classpath. We are considering Xpath.

input_box_xpath = '//*[@id="twotabsearchtextbox"]'search_button_xpath = '//*[@id="nav-search"]/form/div[2]/div/input'

We have the element, we need to send keywords and click on the search button.

After the above code execution

2. Find product list and extract product URL

Similar to 1st step, we are finding the element XPath and looping through all the products.

Here, all the products are part hierarchical ‘div' tags. We need to extract from those.

xpath = "/html/body/div[1]/div[1]/div[1]/div[{}]/div/span[3]/div[2]/div[{}]/div/span/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a"

Here we have an XPath pattern for each product in the list. In pattern, I have mentioned ‘{}’ where the pattern is changing for an individual product.

3. Iterate and browse the product list

Here, this small function will extract & append all the product URLs in a list and return them.

4. Extracting and saving the product details

We are extracting product name, price, ratings for now.

product_name = "#productTitle"
product_price = '#price_inside_buybox'
product_ratings = "span.arp-rating-out-of-text"
Scrapping results

Conclusion

In the post, I have discussed how to perform web scraping on dynamic websites and listed the components used by Selenium for automating the process to reach the desired data.

Please click the Clap button if you like the post.

Further Reading

More content at plainenglish.io. Sign up for our free weekly newsletter here.

--

--

Fallen for data and understand the problems which can be resolve. Passionate for ML and MLOps.