Python Web Scraping Guide
Python Web Scraping Guide
Data Collection:
From A to Z
Introduction
In today’s fast-changing business world, data gathering is essential
for every data-driven business, so the concept of web scraping
becomes more and more known to many. Data collection at scale
manually is a time-consuming task, so by automating the whole
process with web scraping, companies can focus on more vital tasks.
2
Why is Python Used For Web Scraping 4
Python advantages for web scraping 4
Python libraries used for web scraping 5
Conclusion 34
3
Why is Python Used For Web Scraping
Python is an interpreted, general-purpose, and high-level
programming language. Python is used for pretty much anything
you would need, from building web apps to data analysis. Python’s
creators gave attention to its syntax and code readability, so now it
allows developers to express concepts in fewer lines of code. This is
the main reason why Python was created in the first place.
Diverse libraries. P
ython has a fantastic collection of libraries such as
BeautifulSoup, Selenium, lxml, and much more. These libraries are a
perfect fit for web scraping and, also, for further work with extracted
data. You'll find more information about these libraries below.
Easy to use. T
o put it simply, Python is easy to code. Of course, it’s
wrong to believe that you would easily write a code for web scraping
without any programming knowledge. But, compared to other
languages, it’s much easier to use as you do not have to add
semicolons like “;” or curly-brackets “{}” everywhere. Many developers
agree that this is the reason why Python is less messy. Furthermore,
Python syntax is clear and easy to read. Developers can simply
navigate between different blocks in the code.
4
Saves time. A
s you probably know, web scraping was created to
simplify time-consuming tasks like collecting vast amounts of data
manually. Using Python for web scraping is similar because you are
able to write a little bit of code that completes a large task. Python
saves a bunch of developers’ time.
Community. A
s Python is one of the most popular programming
languages, it also has a very active community. Developers are
sharing their knowledge on various questions, so if you are
struggling while writing the code, you can always search for help.
Selenium. T
he primary purpose of Selenium is to test web
applications. However, it’s not limited to do just that as you can use
Selenium for web scraping. It automates script processes because,
for web scraping, the script needs to interact with a browser to
perform repetitive tasks like clicking, scrolling, etc.
Requests (HTTP for Humans). This library is used for making various
types of HTTP requests like GET, POST. Python Requests library
5
retrieves only static content of the page. This library doesn’t parse the
HTML data extracted from web sites. However, r equests library can
be used for basic web scraping tasks.
Now that we know what Python is good for, it should be easier to
understand its appeal, especially for web scraping.
6
Python Web Scraping Tutorial:
Step-By-Step
Python is one of the easiest ways to get started as it is an
object-oriented language. Python’s classes and objects are
significantly easier to use than in any other language. Additionally,
many libraries exist that make building a tool for web scraping in
Python an absolute breeze.
This web scraping tutorial will work for all operating systems. There
will be slight differences when installing either Python or
development environments but not in anything else.
7
Getting to the libraries
A barebones installation isn’t enough for web scraping. We’ll be
using three important libraries – BeautifulSoup v4, Pandas, and
Selenium.
To install these libraries, start the terminal of your OS. Type in:
Headless browsers can be used later on as they are more efficient for
complex tasks. Throughout this web scraping tutorial we will be
using the Chrome web browser although the entire process is
almost identical with Firefox.
8
current version. Download the webdriver that matches your
browser’s version.
If you already have Visual Studio Code installed, picking this IDE
would be the simplest option. Otherwise, I’d highly recommend
PyCharm for any newcomer as it has very little barrier to entry and
an intuitive UI. We will assume that PyCharm is used for the rest of
the web scraping tutorial.
In PyCharm, right click on the project area and “New -> Python File”.
Give it a nice name!
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
PyCharm might display these imports in grey as it automatically
marks unused libraries. Don’t accept its suggestion to remove
unused libs (at least yet).
9
We should begin by defining our browser. Depending on the
webdriver we picked back in “WebDriver and browsers” we should
type in:
driver =
webdriver.Chrome(executable_path='c:\path\to\windows\webdri
ver\executable.exe')
OR
driver =
webdriver.Firefox(executable_path='/nix/path/to/webdriver/e
xecutable')
Picking a URL
Before performing our first test run, choose a URL. As this web
scraping tutorial is intended to create an elementary application, we
highly recommended picking a simple target URL:
Select the landing page you want to visit and input the URL into the
driver.get(‘URL’) parameter. Selenium requires that the connection
protocol is provided. As such, it is always necessary to attach “http://”
or “https://” to the URL.
10
driver.get('https://your.url/here?yes=brilliant')
11
Lists in Python are ordered, mutable and allow duplicate members.
Other collections, such as sets or dictionaries, can be used but lists
are the easiest to use. Time to make more objects!
content = driver.page_source
Before we go on with, let’s recap on how our code should look so far:
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
12
Extracting data with our Python web scraper
We have finally arrived at the fun and difficult part – extracting data
out of the HTML file. Since in almost all cases we are taking small
sections out of many different parts of the page and we want to store
it into a list, we should process every smaller section and then add it
to the list:
…
Let’s visit the chosen URL in a real browser before continuing. Open
the page source by using CTRL+U (Chrome) or right click and select
“View Page Source”. Find the “closest” class where the data is nested.
Another option is to press F12 to open DevTools to select Element
Picker. For example, it could be nested as:
<h4 class="title">
</h4>
13
Our attribute, “class”, would then be “title”. If you picked a simple
target, in most cases data will be nested in a similar way to the
example above. Complex targets might require more effort to get
the data out. Let’s get back to coding and add the class we found in
the source:
Our loop will now go through all objects with the class “title” in the
page source. We will process each of them:
name = element.find('a')
Let’s take a look at how our loop goes through the HTML:
<h4 class="title">
Our first statement (in the loop itself) finds all elements that match
tags, whose “class” attribute contains “title”. We then execute
another search within that class. Our next search finds all the <a>
tags in the document (<a> is included while partial matches like
<span> are not). Finally, the object is assigned to the variable “name”.
We could then assign the object name to our previously created list
array “results” but doing this would bring the entire <a href…> tag
with the text inside it into one element. In most cases, we would only
need the text itself without any additional tags.
14
# `<element>.text` extracts the text in the element,
omitting the HTML tags.
results.append(name.text)
Our loop will go through the entire page source, find all the
occurrences of the classes listed above, then append the nested data
to our list:
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
name = element.find('a')
results.append(name.text)
Note that the two statements after the loop are indented. Loops
require indentation to denote nesting. Any consistent indentation
will be considered legal. Loops without indentation will output an
“IndentationError” with the offending statement pointed out with
the “arrow”.
15
whether we actually get the data assigned to the right object and
move to the array correctly.
One of the simplest ways to check if the data you acquired during
the previous steps is being collected correctly is to use “print”. Since
arrays have many different values, a simple loop is often used to
separate each entry to a separate line in the output:
for x in results:
print(x)
print(results)
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
name = a.find('a')
results.append(name.text)
for x in results:
print(x)
16
Running our program now should display no errors and display
acquired data in the debugger window. While “print” is great for
testing purposes, it isn’t all that great for parsing and analyzing data.
You might have noticed that “import pandas” is still greyed out so
far. We will finally get to put the library to good use. I recommend
removing the “print” loop for now as we will be doing something
similar but moving our data to a csv file.
df = pd.DataFrame({'Names': results})
Our two new statements rely on the pandas library. Our first
statement creates a variable “df” and turns its object into a
two-dimensional data table. “Names” is the name of our column
while “results” is our list to be printed out. Note that pandas can
create multiple columns, we just don’t have enough lists to utilize
those parameters (yet).
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
17
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
name = a.find('a')
results.append(name.text)
df = pd.DataFrame({'Names': results})
18
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
name2 = b.find('span')
other_results.append(name.text)
So far the newest iteration of our code should look something like
this:
import pandas as pd
19
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
name = a.find('a')
results.append(name.text)
name2 = b.find('span')
other_results.append(name.text)
If you are lucky, running this code will output no error. In some cases
“pandas” will output an “ValueError: arrays must all be the same
length” message. Simply put, the length of the lists “results” and
“other_results” is unequal, therefore pandas cannot create a
two-dimensional table.
20
df = pd.DataFrame({'Names': series1, 'Categories':
series2})
Note that data will not be matched as the lists are of uneven length
but creating two series is the easiest fix if two data points are
needed. Our final code should look something like this:
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
other_results = []
content = driver.page_source
soup = BeautifulSoup(content)
name = a.find('a')
results.append(name.text)
name2 = b.find('span')
other_results.append(name.text)
21
df.to_csv('names.csv', index=False, encoding='utf-8')
Running it should create a csv file named “names” with two columns
of data.
● Scrape several URLs in one go. There are many ways to
implement such a feature. One of the simplest options is to
simply repeat the code above and change URLs each time.
That would be quite boring. Build a loop and an array of URLs
to visit.
22
browser. It’s nearly impossible to list all of the possible options
when it comes to creating a scraping pattern.
23
Scrape Images From a Website with
Python
Previously we outlined how to scrape text-based data with Python.
Throughout the tutorial we went through the entire process: all the
way from installing Python, getting the required libraries, setting
everything up to coding a basic web scraper and outputting the
acquired data into a .csv file. In the second installment, we will learn
how to scrape images from a website and store them in a set
location.
24
#install the Pillow library (used for image processing)
Our data extraction process begins almost exactly the same (we will
import libraries as needed). We assign our preferred webdriver,
select the URL from which we’ll scrape image links and create a list
to store them in. As our Chrome driver arrives at the URL, we use the
variable ‘content’ to point to the page source and then “soupify” it
with BeautifulSoup.
25
# Example on how to define a function and select custom
arguments for the
def function_name(arguments):
Before
name = a.find('a')
results.append(name.text)
After
name = a.find(location)
results.append(name.get(source))
26
parameter ‘source’ to it. We use ‘source’ to indicate the field in the
website where image links are stored . They will be nested in a ‘src’,
‘data-src’ or other similar HTML tags.
import pandas as pd
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
results = []
content = driver.page_source
soup = BeautifulSoup(content)
name = a.find(location)
results.append(name.get(source))
27
parse_image_urls("blog-card__link", "img", "src")
df = pd.DataFrame("links": results})
df.to_csv('links.csv', index=False, encoding='utf-8')
We will use the requests library to acquire the content stored in the
image URL. Our “for” loop above will iterate over our ‘results’ list.
import io
We are not done yet. So far the “image” we have above is just a
Python object.
28
#we use Pillow to convert our object to an RGB image
We are still not done as we need to find a place to save our images.
Creating a folder “Test” for the purposes of this tutorial would be the
easiest option.
import pathlib
import hashlib
file_path = pathlib.Path('nix/path/to/test',
hashlib.sha1(image_content).hexdigest()[:10] + '.png')
import hashlib
import io
from pathlib import Path
import pandas as pd
import requests
29
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver
driver =
webdriver.Chrome(executable_path='/nix/path/to/webdriver/ex
ecutable')
driver.get('https://your.url/here?yes=brilliant')
driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
results = []
content = driver.page_source
soup = BeautifulSoup(content)
driver.quit()
if __name__ == "__main__":
returned_results = gets_url("blog-card__link", "img",
"src")
for b in returned_results::
image_content = requests.get(b).content
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = pathlib.Path('nix/path/to/test',
hashlib.sha1(image_content).hexdigest()[:10] + '.png')
image.save(file_path, "PNG", quality=80)
30
● Python outputs a 403 Forbidden HTTP error.
Adding a user-agent will be enough for most cases. There are more
complex cases where servers might try to check other parts of the
HTTP header in order to confirm that it is a genuine user.
Cleaning up
Our task is finished but the code is still messy. We can make our
application more readable and reusable by putting everything under
defined functions:
import io
import pathlib
import hashlib
import pandas as pd
import requests
from bs4 import BeautifulSoup
from PIL import Image
from selenium import webdriver
def get_content_from_url(url):
driver = webdriver.Chrome() # add "executable_path=" if
driver not in running directory
31
driver.get(url)
driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
page_content = driver.page_source
driver.quit() # We do not need the browser instance for
further steps.
return page_content
def save_urls_to_csv(image_urls):
df = pd.DataFrame({"links": image_urls})
df.to_csv("links.csv", index=False, encoding="utf-8")
def main():
url = "https://your.url/here?yes=brilliant"
content = get_content_from_url(url)
image_urls = parse_image_urls(
content=content, classes="blog-card__link",
location="img", source="src",
)
save_urls_to_csv(image_urls)
32
image_url,
output_dir=pathlib.Path("nix/path/to/test"),
)
Everything is now nested under clearly defined functions and can be
called when imported. Otherwise it will run as it had previously.
By using the code outlined above, you should now be able to
complete basic image scraping tasks such as to download all images
from a website in one go.
33
Conclusion
Python is a perfect fit for building web scrapers and extracting data
as it has a large selection of libraries, and an active community to
search for help if you have issues with coding. One of the most
important parts why use Python for web scraping is that Python is
easy to learn, clear to read, and simple to write in.
34