Webscrape Heart Disease News Using BeautifulSoup
How severe can heart disease affect our health?
*This article is for educational purposes only
Hows the latest heart disease discovery updates?
Heart disease, such as coronary artery disease or heart failure, can be fatal.
“17.9 million people die each year from CVDs (cardiovascular disease), an estimated 31% of all deaths worldwide.” — — reported by World Health Organization
Infected Covid-19 patients are reported to have problematic heart conditions after that.
“ …More than two months later, infected patients were more likely to have troubling cardiac signs …” — mentioned by Stat News, according to Valentina O. Puntmann’s recent study.
What are the latest heart disease research news ? Let’s go through this tutorial.
We are going to scrape some latest headlines from www.sciencedaily.com.
403 error if crawl without crawl agent
You might get 403 error if trying to crawl websites that identified you are the bot.
url = "https://www.sciencedaily.com/news/health_medicine/heart_disease/"headers = {"Accept-Language": "en-US, en;q=0.5"}result = Request(url, headers=headers)webpage = urlopen(result).read()
Use crawl agent at the ‘headers’ parameter instead.
You can refer to the link here .
Code:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}results = Request(url, headers=headers)webpage = urlopen(results).read()
Parse webpage through lxml HTML parser using BeautifulSoup.
Scrape headlines
what we are going to scrape is the news under ‘Summaries’ section, which is under ‘div’ tag and class name of ‘tab-content’.
Wait!
There is another headline section above ‘Summaries’, which has the same ‘div’ tag words with ‘tab-content’ in it.
To avoid BeautifulSoup scrape infos from ‘div.featured_blurbs.hero.tab-.content’, filter out this class name:
for item in soup.findAll(True, {"class": re.compile("^(tab-content)$")}):
if 'tab-content' in item.attrs['class'] and 'hero' not in item.attrs['class']:headlines = item
Extract headlines titles
Here are the codes to extract headlines titles:
Here comes the dataframe:
Extract Dates and Descriptions
Inspect dates and descriptions in ‘Summaries’ section.
Here are the codes to extract details of the latest summary:
Note that dates and descriptions are not separated yet. Concatenate ‘Title’ and ‘Details’ columns under same dataframe
heart_df = pd.concat([titles, deetails], axis=1)
Almost there. What’s left is to separate dates with descriptions into different columns.
heart_df[['Date','Description']] = heart_df.Details.apply(lambda x: pd.Series(str(x).split("")))heart_df = heart_df.drop(['Details'], axis=1)
Now we had scraped and stored latest heart disease news into dataframe. Hope you enjoyed this webscraping tutorial.