Web Scraping and HTML Parsing using BeautifulSoup Python Library
Learn basic Web Scraping and HTML parsing using the BeautifulSoup (bs4) Python Library
The necessity of web scraping and HTML parsing is increasing day by day. To tackle parsing tasks, most programmers prefer Python. BeautifulSoup is a useful Python Library for parsing HTML and XML.
In this article, I will discuss how to install BeautifulSoup and parse an HTML page. I will try to collect the available jobs from the Stack Overflow job section.
Firstly, let's dive into the Stack Overflow job page. In the second left column, we can see the job title, Company Name, Location, Technology Requirements, Job Post date and Job type. We will try to parse the job titles from this page and learn.
Installing the BeautifulSoup
For Debian or Ubuntu Linux, we can install like below:
$ apt-get install python2-bs4 #python2
$ apt-get install python3-bs4 #python3
Or, we can install using pip
Like:
$ pip2 install bs4 #python2
$ pip3 install bs4 #python3
We must install a package to work with URLs as we will deal with live web pages. We will install urllib
it for this purpose.
Installing the urllib
We can install using pip
as follows:
$ pip2 install urllib #python2
$ pip3 install urllib #python3
Loading the HTML Page
To load the URL page as an HTML file, we have to use the code snippets.
html_file = urllib.request.urlopen(self.url).read()
Making the HTML Soup
After loading the HTML file, we have to make the soup using BeautifulSoup. Let's use the following code for it.
html_soup = BeautifulSoup(html_file, 'html.parser')
Now we are on the way to enjoy our soup. Really! Yes, we have to dig for the path that we will use to extract our job titles. Let us make our eyes dirty by the search for that path of data. To copy the route, we will follow the following steps:
I. Inspect the HTML Page Codes, select the data to copy and copy the selector path:
Go to our browser, hover over the data we want to extract, and inspect the HTML code. First, we have to select the portion of HTML code that highlights our data.
And the last steps for this path selection, we will copy the Selector as follows:
II. Fix the path and the data that we need
If we can copy the path correctly and assign it to a variable selector
, it will look as follows :
selector = '#content > div.js-search-container.search-container.mbn24 > div > div.grid--cell.fl1.br.bc-black-2 > div > div.listResults > div:nth-child(2) > div:nth-child(2) > div.grid--cell.fl1 > h2 > a'
Extract Data from HTML Soup
To print the data, we will use the code snippet below:
job_post_html = html_soup.select(selector)[0]
Using the previous code snippet, we extract the<a> … </a>
tag of the Job Title.
<a href="/jobs/530033/senior-software-engineer-voltaiq?a=2RKUtYbfQFEY&so=i&pg=1&offset=1&total=1545&so_medium=Internal&so_source=JobSearch" title="Senior Software Engineer" class="s-link stretched-link">Senior Software Engineer</a>
From this tag, we have to select the title. Then, to select the tile, we will make a soup of this tag again and parse the title. We will use the following code for it.
job_post_soup = BeautifulSoup(str(job_post_html), 'html.parser') job_title = job_post_soup.a['title']
Finally, we have the desired data in the variable, job_title
. Here is my final code:
Conclusion
Thanks for your valuable time and for reading this article.
If you find this article helpful, please feel free to comment, and share. In addition, you can visit my web profile, read my blog posts, and follow me on Twitter.
To learn running C/C++ codes in Sublime Text, you may visit this article.
Happy Scraping!
More content at plainenglish.io