KEMBAR78
Python Crawler | PDF
PYTHON
CRAWLER
From Beginner To Intermediate
1
Self Introduction
Cheng-Yi, Yu
erinus.startup@gmail.com
• LifePlus Inc.
Technical Manager
• Paganini Plus Inc.
Senior Software Developer
• Freelancer
~ 10 Years
2
DAY 1
3
Python
• Entry
if __name__ == '__main__':
# do something
• Method
def main ():
# do something
• Package
import [package]
import [package] as [alias]
• Format
'%s' % ([parameters ...])
4
demo01.py
demo02.py
demo03.py
Python
• If … Else …
if [condition]:
# do something
else:
# do something
• For Loop
for item in list:
# do something
• Array Slice
array[start:end]
5
Python
• Array Creation From For Loop
– Object
[item.attr for item in array]
– Dictionary
[item[key] for item in array]
6
demo04.py
Python
• In
– String
if <str> in <str>:
– Array
if item in list:
– Dictionary
if <str:key> in <dict>:
7
Installation
• Ubuntu Fonts
https://design.ubuntu.com/font/
• Source Han Sans Fonts
https://github.com/adobe-fonts/source-han-sans
• Visual Studio Code And Extensions
https://code.visualstudio.com/
8
Installation
• Cmder
http://cmder.net/
• Python 3.6
https://www.python.org/
• Python Packages
pip install requests
pip install pyquery
pip install beautifulsoup4
pip install js2py
pip install selenium
9
Visual Studio Code
10
Visual Studio Code
• Install Python Extensions
11
Visual Studio Code
• Open Integrated Terminal
12
Visual Studio Code
• Open Integrated Terminal
13
Built-in
• Json
import json
– json.loads
json.loads(<str>)
– json.dumps
json.dumps(<dict>)
14
demo05.py
demo06.py
Built-in
• Xml
import xml.etree.ElementTree as ET
– Load From File
tree = ET.ElementTree(file=<str:filepath>)
tree = ET.parse(<str:filepath>)
root = tree.getroot()
– Load From String
root = ET.fromstring(<str>)
15
demo07.py
demo08.py
Built-in
• Xml
– Child Nodes
Only One Level
for node in root:
# do something
– XPath
Multiple Levels
nodes = root.findall(<str:expression>)
for node in nodes:
# do something
16
demo07.py
demo08.py
Built-in
• Url
import urllib.parse as UP
– urlparse
result = UP.urlparse(<str:url>)
– urlunparse
url = UP.urlunparse(<ParseResult>)
– quote
str = UP.quote(<str:unquoted>)
– unquote
str = UP.unquote(<str:quoted>)
17
demo09.py
demo10.py
Built-in
• Regular Expression
import re
– re.search
Find First Match
match = re.search(<str:pattern>, <str:text>)
match.group(<int:index>)
– re.findall
Find All Matches
finds = re.findall(<str:pattern>, <str:text>)
for find in finds:
# do something
18
demo11.py
Built-in
• Regular Expression
– re.split
Split By Pattern
re.split(<str:pattern>, <str:text>)
– re.sub
Replace By Pattern
re.sub(<str:pattern>, <str:replace>, <str:text>)
19
demo11.py
Built-in
• Regular Expression
– Expressions
1. Range
[Start-End]
[0-9], [a-z], [A-Z], [a-zA-Z], [0-9a-zA-Z], ...
20
demo12.py
Built-in
• Regular Expression
– Expressions
1. Zero Or More Times
*
2. One Or More Times
+
3. Zero Or One Time
?
21
demo12.py
Built-in
• Regular Expression
– Expressions
1. Numbers
d = [0-9]
2. Words
w = [a-zA-Z0-9] (ANSI)
w = [a-zA-Z0-9] + Non-ANSI Characters (UTF-8)
3. Spaces, Tabs, …
s
22
demo12.py
Built-in
• Regular Expression
– Expressions
1. Start With
^
2. End With
$
23
demo12.py
Built-in
• Regular Expression
– Expressions
1. Named Group
(?P<name>expr)
(?P<country>+d+)-(?P<phone>d+)
24
demo13.py
AnalySIS
• Chrome Developer Tools
– Elements
See Elements In DOM
Id, Class, Attribute, ...
– Network
See Requests, Responses
Urls, Methods, Headers, Cookies, Bodies, ...
25
Elements
• Find Element by Mouse Pointer
26
Elements
• Find Element by HTML Tag
27
Networks
• See All Requests And Responses
28
Networks
• See Details Of Request And Response
29
Networks
• See Response Content
30
Networks
• See Cookies Sent And Set
31
Documents
• Requests
http://docs.python-requests.org/
• PyQuery
https://pythonhosted.org/pyquery/
• Beautiful Soup 4
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
• Js2Py
https://github.com/PiotrDabkowski/Js2Py
32
Packages
• Requests
import requests
– Request
1. Method
GET, POST, ...
response = requests.get(<str:url>)
response = requests.post(<str:url>, data=<str:body>)
response = requests.post(<str:url>, data=<dict:body>)
2. Session
session = requests.Session()
response = session.get(<str:url>)
33
demo14.py
Packages
• Requests
import requests
– Request
• Headers
response = requests.get(<str:url>, headers=<dict>)
• Cookies
response = requests.get(<str:url>, cookies=<dict>)
34
Packages
• Requests
– Response
1. Status Code
response.status_code
2. Headers
response.headers[<str:name>]
3. Cookies
response.cookies[<str:name>]
35
demo14.py
Packages
• Requests
– Response
1. Binary Content
response.content
2. Text Content
response.text
3. Json Content
response.json()
36
demo14.py
Packages
• PyQuery
import pyquery
– Load From String
d = pyquery.PyQuery(<str:html>)
– Load From Url
d = pyquery.PyQuery(url=<str:url>)
– Load From File
d = pyquery.PyQuery(filename=<str:filepath>)
37
demo15.py
Packages
• PyQuery
– Find
p = d(<str:expression>)
– Element To HTML
p.html()
– Extract Text From Element
p.text()
– Get Value From Element Attribute
val = p.attr[<str:name>]
38
demo15.py
Packages
• Beautiful Soup 4
import bs4
– Load From String
d = bs4.BeautifulSoup(<str:html>, 'html.parser')
39
demo16.py
Packages
• Beautiful Soup 4
– Find
p = d.find_all(<str:tag>, <attr-key>=<attr-val>, ...)
p = d.find_all(<regex>, <attr-key>=<attr-val>, ...)
p = d.find_all(<array>, <attr-key>=<attr-val>, ...)
p = d.find(<str:tag>, <attr-key>=<attr-val>, ...)
p = d.find(<regex>, <attr-key>=<attr-val>, ...)
p = d.find(<array>, <attr-key>=<attr-val>, ...)
p = d.select(<str:expression>)
p = d.select_one(<str:expression>)
40
demo16.py
demo17.py
Packages
• Beautiful Soup 4
– Extract Text From Element
p.get_text()
– Get Value From Element Attribute
p.get(<str:name>)
41
demo16.py
demo17.py
Packages
• Js2Py
import js2py
– Eval
js2py.eval_js(<str:code>)
res = js2py.eval_js('var o = <str:js>; o')
42
demo18.py
DAY 2
43
WORKSHOP
• Apple Daily
https://tw.appledaily.com/
– Realtime News
https://tw.appledaily.com/new/realtime
44
WORKSHOP
• Facebook Page
– Cookies
– Feed
45
DAY 3
46
SELENIUM
• Download ChromeDriver
https://sites.google.com/a/chromium.org/chromedriver/
47
SELENIUM
• Import
import selenium.webdriver
• Initialize
option = elenium.webdriver.ChromeOptions()
• Start
driver = selenium.webdriver.Chrome(chrome_options=option)
• Browse
driver.get(<str:url>)
• Close
driver.quit()
48
SELENIUM
• Source
driver.page_source
49
SELENIUM
• Find One
– find_element_by_id
– find_element_by_name
– find_element_by_tag_name
– find_element_by_class_name
– find_element_by_css_selector
50
SELENIUM
• Find Multiple
– find_elements_by_name
– find_elements_by_tag_name
– find_elements_by_class_name
– find_elements_by_css_selector
51
SELENIUM
• Actions
– send_keys
– click
52
WORKSHOP
• Facebook Page
– Login
– Feed
53
54

Python Crawler