Web Scraping JavaScript Content in Python with Selenium and BeautifulSoup

Part 2

Neeraj Khadagade
Python in Plain English

--

This is the second part of the series. Make sure you read the first part, HERE

Our task in Part 2 is to extract:

  1. Name of the company
  2. Address of the company
  3. Survey scores present in the graphs (I have discussed this below)

Before we move forward, let’s see what survey scores mean.

Survey Scores

If you hover over the graph, you will see such comment boxes. There are 10 such scores for each company (some companies have less than 10 scores), and we have to extract these scores and save them into a data frame.

If you click inspect element, you will not find any data. However, if search for “Satisfaction” in the inspect element, you will find all the survey scores data under the <script> tag, with the variable name chart.data.

Inspect element of the Survey Scores

Observations:

chart.data is enclosed in script tag. This is a JavaScript Tag. Selenium helps us to execute a JavaScript query in Python to extract the data.

If you see carefully, chart.data is a list of dictionaries. If we extract this list of dictionaries in Python, we will be able to extract the required fields.

In the following article, you will see how easy, and seamless Selenium is when it comes to handling the JavaScript tags in an HTML files.

Let’s continue with the scraping.

In Step 4, we had stored the links of the companies in a list. We will be using these links to redirect to each company’s profile and extract information.

The list “links”

Step 5: Iterating through each company’s link in the list links.

First, we import the relevant libraries. This is where we start using Selenium.

Also, we create 2 items here:

  1. A dictionary “data” to save the name, address, and the survey scores
  2. An empty dataframe “df”, where we would append and save the dictionary, data
# Importing the Selenium libraries
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
data = {}
df = pd.DataFrame(columns = ['Name','Address'])

And we loop through each link from the list, links.

for link in links:
# Extract the scores
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
path = "C:\\Users\\lenovo\\Documents\\chromedriver.exe"
driver = webdriver.Chrome(path, options = options)
driver.get(link)

The last line driver.get(link) is where the Chrome Driver starts accessing each link.

NOTE: You will have to download the “chromedriver.exe from this link if you don’t have it downloaded already. Make sure the version is the same as your current Google Chrome’s version.

Step 6: We are at the most interesting step, where we start fetching the scores. This is where we write our JavaScript query.

To extract the scores, we first create an empty list metric_values; we will fetch the list of dictionaries, chart.data, and append it. Since this list is under for loop, a new empty list will be created for each iteration of the link.

# Fetching the score from the JavaScript Tag
metric_values = []
# This is the JavaScript query to extract chart.data
results_vals = driver.execute_script
("""""
let results_vals = chart.data.map(({result})=>result);
console.log(results_vals);
return results_vals
""""")
metric_values.append(results_vals)

We will use this metric_values to fetch the survey scores.

Step 7: Extract the name, address for each company, and save the scores for each survey scores.

# Extracting info
soup = requests.get(link)
webpage = BeautifulSoup(soup.content, 'html.parser')
# Name of the company
data['Name'] = webpage.find('div', class_ = 'letterhead').find('h1', class_ = 'styled').get_text()
try:
data['Address'] = webpage.find('div', class_ = 'contact-box').find('div', class_ = 'address contact').text.rstrip()
except:
data['Address'] = "Not Found"
# Satisfaction Scores
try:
data['Satisfaction'] = ''.join([score[0].strip() for score in
metric_values])
except:
data['Satisfaction'] = "Not Found"
# Expectations Scores
try:
data['Expectations'] = ''.join([score[1].strip() for score in
metric_values])
except:
data['Expectations'] = "Not Found"
... and so on for all the survey scores# Converting the list dictionary data into a dataframe
scores = scores.append(data, ignore_index=True)

At the end, we will convert the dictionary data to a dataframe.

The field “name” will act as the common key to merge both the dataframes profile (discussed in step 4 of Part 1), and scores.

Step 8: Merge both the dataframes

Finally, it’s time to merge both the dataframes on “Name”. And there you go!

final_df = pd.merge(profiles, scores, left_on='Name', right_on='Name')final_df.head(5)

As you can see, the name, description, awards/recognitions, address, and scores have been properly extracted and saved in a dataframe.

You can find the link to Part 2 of the Python Notebook HERE

I took about 4 days figure out to understand, strategize, and execute the program by learning the capabilities of Selenium and other relevant Python libraries. The program can still be optimized and improved. Although, I believe this is a good starting point for you to understand the advantage of using Selenium for scraping data in JavaScript tag.

I’m really glad to share this blog and my knowledge. I hope my research and experience will help you. You can refer to the complete Python Notebook HERE.

Do let me know your views or any suggestions in the comments section below. Thank you!

--

--