KEMBAR78
Introduction to Web Scraping with Python | PDF
An Introduction to Web Scraping
with Python and DataCamp
Olga Scrivner, Research Scientist, CNS, CEWIT
WIM, February 23, 2018
0
Objectives
Materials: DataCamp.com
Review: Importing files
Accessing Web
Review: Processing text
Practice, practice, practice!
1
Credits
Hugo Bowne-Anderson - Importing Data in Python (Part 1
and Part 2)
Jeri Wieringa - Intro to Beautiful Soup
2
Importing Files
File Types: Text
Text files are structured as a sequence of lines
Each line includes a sequence of characters
Each line is terminated with a special character End of Line
4
Special Characters: Review
5
Special Characters: Answers
6
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
7
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
Quiz question: Why do we use quotes with ‘r’ and ‘w’?
7
Modes
Reading Mode
◦ ‘r’
Writing Mode
◦ ‘w’
Quiz question: Why do we use quotes with ‘r’ and ‘w’?
Answer: ‘r’ and ‘w’ are one-character strings
7
Open - Close
Open File - open(name, mode)
◦ name = ’filename’
◦ mode = ’r’ or mode = ’w’
8
Open New File
9
Open New File
9
Read File
Read the Entire File - filename.read()
Read ONE Line - filename.readline()
- Return the FIRST line
- Return the THIRD line
Read lines - filename.readlines()
10
Read File
Read the Entire File - filename.read()
Read ONE Line - filename.readline()
- Return the FIRST line
- Return the THIRD line
Read lines - filename.readlines()
What type of object and what is the length of this object?
10
Python Libraries
Import Modules (Libraries)
Beautiful Soup
urllib
More in next slides ...
For installation - https://programminghistorian.org/lessons/
intro-to-beautiful-soup
12
Review: Module I
To use external functions (modules), we need to import them:
1. Declare it at the top of the code
2. Use import
3. Call the module
13
Review: Modules II
To refer and import a specific function from the module
1. Declare it at the top pf the code
2. Use from import
3. Call the randint function from random module:
random.randint()
14
How to Import Packages with Modules
1. Install via a terminal or console
◦ Type command prompt in window search
◦ Type terminal in Mac search
15
How to Import Packages with Modules
1. Install via a terminal or console
◦ Type command prompt in window search
◦ Type terminal in Mac search
2. Check your Python Version
3. Click return/enter
15
Python 2 (pip) or Python 3 (pip3)
pip or pip3 - a tool for installing Python packages
To check if pip is installed:
https://packaging.python.org/tutorials/installing-packages/
16
Web Scraping Workflow
Web Concept
1. Import the necessary modules (functions)
2. Specify URL
3. Send a REQUEST
4. Catch RESPONSE
5. Return HTML as a STRING
6. Close the RESPONSE
18
URLs
19
URLs
1. URL - Uniform/Universal Resource Locator
2. A URL for web addresses consists of two parts:
2.1 Protocol identifier - http: or https:
2.2 Resource name - datacamp.com
19
URLs
1. URL - Uniform/Universal Resource Locator
2. A URL for web addresses consists of two parts:
2.1 Protocol identifier - http: or https:
2.2 Resource name - datacamp.com
3. HTTP - HyperText Transfer Protocol
4. HTTPS - more secure form of HTTP
5. Going to a website = sending HTTP request (GET request)
6. HTML - HyperText Markup Language
19
URLLIB package
Provide interface for getting data across the web. Instead of file
names we use URLS
Step 1 Install the package urllib (pip install urllib)
Step 2 Import the function urlretrieve - to RETRIEVE urls
during the REQUEST
Step 3 Create a variable url and provide the url link
url = ‘https:somepage’
Step 4 Save the retrieved document locally
Step 5 Read the file
20
Your Turn - DataCamp
DataCamp.com - create a free account using IU email
1. Log in
2. Select Groups
3. Select RBootcampIU - see Jennifer if you do not see it
4. Go to Assignments and select Importing Data in Python
21
Today’s Practice
22
Importing Flat Files
urlretrieve has two arguments: url (input) and file name
(output)
Example: urlretrieve(url, ‘file.name’)
23
Importing Flat Files
24
Opening and Reading Files
read_csv has two arguments: url and sep (separator)
pd.head()
25
Opening and Reading Files
read_csv has two arguments: url and sep (separator)
pd.head()
26
Importing Non-flat Files
read_excel has two arguments: url and sheetname
To read all sheets, sheetname = None
Let’s use a sheetname ’1700’
27
Importing Non-flat Files
28
HTTP Requests
read_excel has two arguments: url and sheetname
To read all sheets, sheetname = None
Let’s use a sheetname ’1700’
29
GET request
Import request package
30
HTTP with urllib
31
HTTP with urllib
32
Print HTTP with urllib
Use response.read()
33
Print HTTP with urllib
34
Return Web as a String
Use r.text
35
Return Web as a String
36
Scraping Web - HTML
37
Scraping Web - HTML
37
Scraping Web - BeautifulSoup Workflow
38
Many Useful Functions
soup.title
soup.get_text()
soup.find_all(’a’)
39
Parsing HTML with BeautifulSoup
40
Parsing HTML with BeautifulSoup
41
Turning a Webpage into Data with BeautifulSoup
soup.title
soup.get_text() 42
Turning a Webpage into Data with BeautifulSoup
43
Turning a Webpage into Data - Hyperlinks
HTML tag - <a>
find_all(’a’)
Collect all href: link.get(’href’)
44
Turning a Webpage into Data - Hyperlinks
45

Introduction to Web Scraping with Python