Web Scraping JavaScript Content in Python with Selenium and BeautifulSoup

Part 2

Neeraj Khadagade

Published in

Python in Plain English

4 min readNov 22, 2020

This is the second part of the series. Make sure you read the first part, HERE

Our task in Part 2 is to extract:

Name of the company
Address of the company
Survey scores present in the graphs (I have discussed this below)

Before we move forward, let’s see what survey scores mean.

If you hover over the graph, you will see such comment boxes. There are 10 such scores for each company (some companies have less than 10 scores), and we have to extract these scores and save them into a data frame.
If you click inspect element, you will not find any data. However, if search for “Satisfaction” in the inspect element, you will find all the survey scores data under the <script> tag, with the variable name chart.data.

Observations:
chart.data is enclosed in script tag. This is a JavaScript Tag. Selenium helps us to execute a JavaScript query in Python to extract the data.
If you see carefully, chart.data is a list of dictionaries. If we extract this list of dictionaries in Python, we will be able to extract the required fields.

In the following article, you will see how easy, and seamless Selenium is when it comes to handling the JavaScript tags in an HTML files.

Let’s continue with the scraping.

In Step 4, we had stored the links of the companies in a list. We will be using these links to redirect to each company’s profile and extract information.

Step 5: Iterating through each company’s link in the list links.

First, we import the relevant libraries. This is where we start using Selenium.

Also, we create 2 items here:

A dictionary “data” to save the name, address, and the survey scores
An empty dataframe “df”, where we would append and save the dictionary, data

# Importing the Selenium libraries
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Optionsdata = {}
df = pd.DataFrame(columns = ['Name','Address'])

And we loop through each link from the list, links.

for link in links:
 # Extract the scores
 options = Options()
 options.add_argument('--headless')
 options.add_argument('--no-sandbox')
 path = "C:\\Users\\lenovo\\Documents\\chromedriver.exe"
 driver = webdriver.Chrome(path, options = options)
 driver.get(link)

The last line driver.get(link) is where the Chrome Driver starts accessing each link.

NOTE: You will have to download the “chromedriver.exe” from this link if you don’t have it downloaded already. Make sure the version is the same as your current Google Chrome’s version.

Step 6: We are at the most interesting step, where we start fetching the scores. This is where we write our JavaScript query.

To extract the scores, we first create an empty list metric_values; we will fetch the list of dictionaries, chart.data, and append it. Since this list is under for loop, a new empty list will be created for each iteration of the link.

# Fetching the score from the JavaScript Tag
 metric_values = []
# This is the JavaScript query to extract chart.data
 results_vals = driver.execute_script
("""""
      let results_vals = chart.data.map(({result})=>result);
                         console.log(results_vals);
                         return results_vals
""""") metric_values.append(results_vals)

We will use this metric_values to fetch the survey scores.

Step 7: Extract the name, address for each company, and save the scores for each survey scores.

# Extracting info
soup = requests.get(link)
webpage = BeautifulSoup(soup.content, 'html.parser')# Name of the company
data['Name'] = webpage.find('div', class_ = 'letterhead').find('h1', class_ = 'styled').get_text()try:
 data['Address'] = webpage.find('div', class_ = 'contact-box').find('div', class_ = 'address contact').text.rstrip()
except:
 data['Address'] = "Not Found"# Satisfaction Scores
try:
 data['Satisfaction'] = ''.join([score[0].strip() for score in
                                metric_values])
except:
 data['Satisfaction'] = "Not Found"# Expectations Scores
try:
 data['Expectations'] = ''.join([score[1].strip() for score in
                                 metric_values])
except:
 data['Expectations'] = "Not Found"... and so on for all the survey scores# Converting the list dictionary data into a dataframe
scores = scores.append(data, ignore_index=True)

At the end, we will convert the dictionary data to a dataframe.

The field “name” will act as the common key to merge both the dataframes profile (discussed in step 4 of Part 1), and scores.

Step 8: Merge both the dataframes

Finally, it’s time to merge both the dataframes on “Name”. And there you go!

final_df = pd.merge(profiles, scores, left_on='Name', right_on='Name')final_df.head(5)

As you can see, the name, description, awards/recognitions, address, and scores have been properly extracted and saved in a dataframe.

You can find the link to Part 2 of the Python Notebook HERE

I took about 4 days figure out to understand, strategize, and execute the program by learning the capabilities of Selenium and other relevant Python libraries. The program can still be optimized and improved. Although, I believe this is a good starting point for you to understand the advantage of using Selenium for scraping data in JavaScript tag.

I’m really glad to share this blog and my knowledge. I hope my research and experience will help you. You can refer to the complete Python Notebook HERE.

Do let me know your views or any suggestions in the comments section below. Thank you!

Web Scraping JavaScript Content in Python with Selenium and BeautifulSoup

Part 2

In the following article, you will see how easy, and seamless Selenium is when it comes to handling the JavaScript tags in an HTML files.

Written by Neeraj Khadagade