KEMBAR78
Scraping recalcitrant web sites with Python & Selenium | PDF
Scraping recalcitrant web sites
              with Python & Selenium
                     Roger Barnes




SyPy July 2012
Some sites suck
Some sites suck - "for your own good"




For security reasons, each button is
an image, dynamically generated by
a hash wrapped in a mess of
javascript, randomly placed
...but they work in a web browser!




  Let's use the web browser to scrape them
Enter Selenium



      Selenium automates browsers

                 That's it
Selenium can...
●   navigate (windows, frames, links)
●   find elements and parse attributes
●   interact and trigger events (click, type, ...)
●   capture screenshots
●   run javascript
●   let the browser take care of the hard stuff
    (cookies, javascript, sessions, profiles,
    DOM)

Comes with various components and bindings
                         ... including python
General Recipe
Ingredients:
● firefox (or chrome)
● firebug (or chrome dev tools)
● Selenium IDE
    ○ record a session, write less code
●   python and its batteries
●   python-selenium
●   xvfb and pyvirtualdisplay (optional)
●   other libraries to taste
    ○ eg image manipulation, database access, DOM
      parsing, OCR
General Recipe
Method:
● Install requirements (apt-get, pip etc)
   ○ sudo apt-get install xvfb firefox
   ○ pip install selenium pyvirtualdisplay
● Start up Firefox and Selenium IDE
● Record a "test" run through site
   ○ Add in some assertions along the way
● Export test as Python script
● Hack from there
   ○ Loops
   ○ Image/data extraction
   ○ Wrangling data into a database
Example from Selenium IDE
class Ingdirect2(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait( 30)
        self.base_url = "https://www.ingdirect.com.au"
        self.verificationErrors = []

   def test_ingdirect2(self):
       driver = self.driver
                                                           But what about
       driver.get( self.base_url + "/client/index.aspx")
                                                           that dang
       driver.switch_to_frame( 'body') # Had to add this keypad? ...
       driver.find_element_by_id( "txtCIF").clear()
       driver.find_element_by_id( "txtCIF").send_keys( "12345678")
       driver.find_element_by_id( "objKeypad_B1").click()
       driver.find_element_by_id( "objKeypad_B2").click()
       driver.find_element_by_id( "objKeypad_B3").click()
       driver.find_element_by_id( "objKeypad_B4").click()
       driver.find_element_by_id( "btnLogin").click()
       self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
PIL saves the day
# Get screenshot for extraction of button images
screenshot = driver.get_screenshot_as_base64()
im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))

table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table')
all_buttons = table.find_elements_by_tag_name( "input")

# Determine md5sum of each button by cropping based on element positions
for button in all_buttons:
    button_image = im.crop(getcropbox(button))
    hexid = hashlib.md5(button_image.tostring()).hexdigest()
    button_mapping[hexid] = button.get_attribute( "id")


# Now we know which button is which ( based on previous lookup), enter the PIN
for char in self.pin:
    driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()

driver.find_element_by_id( "btnLogin").click()

# We're in!!!11one
But why do all this?
It's my data!                                  ... and I'll graph if i want to




       * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
That's all folks
Slides
● http://bit.ly/scrapium

Code
● https://gist.github.com/3015852

Me
● https://twitter.com/mindsocket
● https://github.com/mindsocket
● roger@mindsocket.com.au

Scraping recalcitrant web sites with Python & Selenium

  • 1.
    Scraping recalcitrant websites with Python & Selenium Roger Barnes SyPy July 2012
  • 2.
  • 3.
    Some sites suck- "for your own good" For security reasons, each button is an image, dynamically generated by a hash wrapped in a mess of javascript, randomly placed
  • 4.
    ...but they workin a web browser! Let's use the web browser to scrape them
  • 5.
    Enter Selenium Selenium automates browsers That's it
  • 6.
    Selenium can... ● navigate (windows, frames, links) ● find elements and parse attributes ● interact and trigger events (click, type, ...) ● capture screenshots ● run javascript ● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM) Comes with various components and bindings ... including python
  • 7.
    General Recipe Ingredients: ● firefox(or chrome) ● firebug (or chrome dev tools) ● Selenium IDE ○ record a session, write less code ● python and its batteries ● python-selenium ● xvfb and pyvirtualdisplay (optional) ● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8.
    General Recipe Method: ● Installrequirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay ● Start up Firefox and Selenium IDE ● Record a "test" run through site ○ Add in some assertions along the way ● Export test as Python script ● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 10.
    Example from SeleniumIDE class Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( 'body') # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 11.
    PIL saves theday # Get screenshot for extraction of button images screenshot = driver.get_screenshot_as_base64() im = Image.open(StringIO.StringIO(base64.decodestring(screenshot))) table = driver.find_element_by_xpath( '//*[@id="objKeypad_divShowAll"]/table') all_buttons = table.find_elements_by_tag_name( "input") # Determine md5sum of each button by cropping based on element positions for button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id") # Now we know which button is which ( based on previous lookup), enter the PIN for char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click() driver.find_element_by_id( "btnLogin").click() # We're in!!!11one
  • 12.
    But why doall this? It's my data! ... and I'll graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 13.
    That's all folks Slides ●http://bit.ly/scrapium Code ● https://gist.github.com/3015852 Me ● https://twitter.com/mindsocket ● https://github.com/mindsocket ● roger@mindsocket.com.au