webscraping

Web Srcaping with Python

Webscraping is the process of gathering information from the Internet. The process involves automation. Automated webscrpaing can help us collect data in short period. We can write a code that will get the information as many times from many pages.

API

Application Programming Interface (API) is another alternative to web scraping. However, only some of the website offers APIs. This allows to access data from the web page without parsing HTML and instead access the data directly using formats like JSON and XML.

In this blog, I will demonstrate some basic commands to extract data from the web page. We also need to install required packages before doing web scraping.

Reasons why to use Web Scraping over APIs

There are couple of reasons why we need WebScraping and why it is preferable over the use of an API.

  • The website you want to extract data from does not provide an API
  • The API provided is not free
  • The API provided is rate limited: you can only access it a number of times per visit
  • The API does not expose all the data you wish to obtain.

Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML.

Installing request module in python. It allows to send HTTP requests using Python. The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.)

pip install requests

Installing BeautifulSoup Library:

pip install beautifulsoup4 

Installing a Parser:

pip install lxml
pip install html5lib

First, we need to determine how the web page has been created. It is important to Inspect the browser before doing any Iterations, webscraping, etc.

I have follwed the step by step from this book Practical Web Scraping for Data Science: Best Practices and Examples with Python which is available for CSUN students in SAFARI online book website.

EXAMPLE:

Here, I will access the CSUN computer science department web page and perform simple webscraping.

from bs4 import BeautifulSoup
import requests

url_cs= "https://www.csun.edu/engineering-computer-science/computer-science"

Creating a BeautifulSoup Object and getting the Header information:

find function allows to look for specified tags. Here, I have used ‘find’ to look for h1 header

req= requests.get(url_cs)
html_soup_cs= BeautifulSoup(req.text, 'html.parser')
html_soup_cs.find('h1')

output:

<h1>Welcome to the Computer Science Department</h1>

Another important task or Iteration is extracting all the URLs within a webpage. Here, find_all function will look into all the tags that match.

for links_cs in html_soup_cs.find_all('a'):
	print(links_cs.get('href')

output:

/
/engineering-computer-science
/engineering-computer-science/computer-science
#main-content
#main-content
https://www.csun.edu/accessibility
http://www.csun.edu/peoplefinder
http://www.csun.edu/calendar
http://www.csun.edu/atoz/
http://www.csun.edu/webmail
https://auth.csun.edu/cas/login?service=https://mynorthridge.csun.edu/psp/PANRPRD/?cmd=login&languageCd=ENG
http://www.csun.edu/webmail
http://www.csun.edu/peoplefinder
https://canvas.csun.edu/
/engineering-computer-science/computer-science
/engineering-computer-science/computer-science/academic-program
.....
.....

#NOTE:

Performing Web Scraping is an important tool to understand the data available in the websites. However, it is important to understand the Rules and Policies before doing any scraping from any random websites.

It would be good ideal practice to do research before violating any Terms of Service before jumping to do Web Scraping.