Web Scraping and HTML Parsing using BeautifulSoup Python Library

Learn basic Web Scraping and HTML parsing using the BeautifulSoup (bs4) Python Library

Habib Rahman
Python in Plain English

--

Photo by Lee Campbell on Unsplash

The necessity of web scraping and HTML parsing is increasing day by day. To tackle parsing tasks, most programmers prefer Python. BeautifulSoup is a useful Python Library for parsing HTML and XML.

In this article, I will discuss how to install BeautifulSoup and parse an HTML page. I will try to collect the available jobs from the Stack Overflow job section.

Firstly, let's dive into the Stack Overflow job page. In the second left column, we can see the job title, Company Name, Location, Technology Requirements, Job Post date and Job type. We will try to parse the job titles from this page and learn.

Screenshot from Stack Overflow Jobs

Installing the BeautifulSoup

For Debian or Ubuntu Linux, we can install like below:

$ apt-get install python2-bs4 #python2
$ apt-get install python3-bs4 #python3

Or, we can install using pip Like:

$ pip2 install bs4 #python2
$ pip3 install bs4 #python3

We must install a package to work with URLs as we will deal with live web pages. We will install urllib it for this purpose.

Installing the urllib

We can install using pip as follows:

$ pip2 install urllib #python2
$ pip3 install urllib #python3

Loading the HTML Page

To load the URL page as an HTML file, we have to use the code snippets.

html_file = urllib.request.urlopen(self.url).read()

Making the HTML Soup

After loading the HTML file, we have to make the soup using BeautifulSoup. Let's use the following code for it.

html_soup = BeautifulSoup(html_file, 'html.parser')

Now we are on the way to enjoy our soup. Really! Yes, we have to dig for the path that we will use to extract our job titles. Let us make our eyes dirty by the search for that path of data. To copy the route, we will follow the following steps:

I. Inspect the HTML Page Codes, select the data to copy and copy the selector path:

Go to our browser, hover over the data we want to extract, and inspect the HTML code. First, we have to select the portion of HTML code that highlights our data.

Inspecting the HTML Code of the Data

And the last steps for this path selection, we will copy the Selector as follows:

Copying the Selector of the Data

II. Fix the path and the data that we need

If we can copy the path correctly and assign it to a variable selector, it will look as follows :

selector = '#content > div.js-search-container.search-container.mbn24 > div > div.grid--cell.fl1.br.bc-black-2 > div > div.listResults > div:nth-child(2) > div:nth-child(2) > div.grid--cell.fl1 > h2 > a'

Extract Data from HTML Soup

To print the data, we will use the code snippet below:

job_post_html = html_soup.select(selector)[0]

Using the previous code snippet, we extract the<a> … </a>tag of the Job Title.

<a href="/jobs/530033/senior-software-engineer-voltaiq?a=2RKUtYbfQFEY&amp;so=i&amp;pg=1&amp;offset=1&amp;total=1545&amp;so_medium=Internal&amp;so_source=JobSearch" title="Senior Software Engineer" class="s-link stretched-link">Senior Software Engineer</a>

From this tag, we have to select the title. Then, to select the tile, we will make a soup of this tag again and parse the title. We will use the following code for it.

job_post_soup = BeautifulSoup(str(job_post_html), 'html.parser')                job_title = job_post_soup.a['title']

Finally, we have the desired data in the variable, job_title. Here is my final code:

HTMLParser with Stack Overflow Job title Parsing Example — Author

Conclusion

Thanks for your valuable time and for reading this article.

If you find this article helpful, please feel free to comment, and share. In addition, you can visit my web profile, read my blog posts, and follow me on Twitter.

To learn running C/C++ codes in Sublime Text, you may visit this article.

Happy Scraping!

More content at plainenglish.io

--

--