Webscrape Heart Disease News Using BeautifulSoup

How severe can heart disease affect our health?

Irene Too
Python in Plain English

--

*This article is for educational purposes only

Hows the latest heart disease discovery updates?

Heart disease, such as coronary artery disease or heart failure, can be fatal.

“17.9 million people die each year from CVDs (cardiovascular disease), an estimated 31% of all deaths worldwide.” — — reported by World Health Organization

Infected Covid-19 patients are reported to have problematic heart conditions after that.

“ …More than two months later, infected patients were more likely to have troubling cardiac signs …” — mentioned by Stat News, according to Valentina O. Puntmann’s recent study.

What are the latest heart disease research news ? Let’s go through this tutorial.

We are going to scrape some latest headlines from www.sciencedaily.com.

Image by author

403 error if crawl without crawl agent

You might get 403 error if trying to crawl websites that identified you are the bot.

url = "https://www.sciencedaily.com/news/health_medicine/heart_disease/"headers = {"Accept-Language": "en-US, en;q=0.5"}result = Request(url, headers=headers)webpage = urlopen(result).read()

Use crawl agent at the ‘headers’ parameter instead.

You can refer to the link here .

Code:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}results = Request(url, headers=headers)webpage = urlopen(results).read()

Parse webpage through lxml HTML parser using BeautifulSoup.

Scrape headlines

what we are going to scrape is the news under ‘Summaries’ section, which is under ‘div’ tag and class name of ‘tab-content’.

Image by author

Wait!

There is another headline section above ‘Summaries’, which has the same ‘div’ tag words with ‘tab-content’ in it.

Image by author

To avoid BeautifulSoup scrape infos from ‘div.featured_blurbs.hero.tab-.content’, filter out this class name:

for item in soup.findAll(True, {"class": re.compile("^(tab-content)$")}):
if 'tab-content' in item.attrs['class'] and 'hero' not in item.attrs['class']:
headlines = item
Image by author

Extract headlines titles

h3.latest-head tells us that h3 tag contains class name ‘latest-head’

Here are the codes to extract headlines titles:

Here comes the dataframe:

Extract Dates and Descriptions

Inspect dates and descriptions in ‘Summaries’ section.

The ‘div.latest-summary’ tells us that ‘div’ tag contains class name ‘latest-summary’

Here are the codes to extract details of the latest summary:

Note that dates and descriptions are not separated yet. Concatenate ‘Title’ and ‘Details’ columns under same dataframe

heart_df = pd.concat([titles, deetails], axis=1)

Almost there. What’s left is to separate dates with descriptions into different columns.

heart_df[['Date','Description']] = heart_df.Details.apply(lambda x: pd.Series(str(x).split("—")))heart_df = heart_df.drop(['Details'], axis=1)

Now we had scraped and stored latest heart disease news into dataframe. Hope you enjoyed this webscraping tutorial.

References

  1. https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1
  2. https://www.statnews.com/2020/07/27/covid19-concerns-about-lasting-heart-damage/

--

--