KEMBAR78
Web Scraping with PHP | PPT
Web Scraping with  Matthew Turland Acadiana Open Source Group April 30, 2009
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is It Useful?
Data Without Web Services
Integration Testing
Crawlers
With plain text, we give ourselves the  ability to manipulate knowledge, both  manually and programmatically, using  virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
More Requests
No Nice Neat Data Package
Step #1: Retrieval
Speaking the Language
The Web We Weave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
GET   /index.php?foo=bar   HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot;  action= &quot;/index.php&quot; > <input name= &quot;foo&quot;  value= &quot;bar&quot;  /> </form> POST  /index.php   HTTP/1.1 foo = bar Browsing -> Requests
HTTP/1.1 200 OK Content-Type : image/gif Content-Length:  8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot;  /> GET   /intl/en_ALL/images/logo.gif   HTTP/1.1 Host:  google.com
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries PEAR::HTTP_Client pecl_http Zend_Http_Client Streams ,  cURL
Simple Streams Example $uri  =  'http://www.example.com/some/resource' ; $get  = file_get_contents( $uri ); $context  = stream_context_create( array ( 'http'  =>  array ( 'method'  =>  'POST' , 'header'  =>  'Content-Type: '  . 'application/x-www-form-urlencoded' , 'content'  => http_build_query( array ( 'var1'  =>  'value1' , 'var2'  =>  'value2' )) ) ) ); $post  = file_get_contents( $uri , false,  $context );
pecl_http Example $http  = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1'  =>  'value1' )); $http ->setOptions( 'useragent'  =>  'PHP '  .  phpversion (), 'referer'  =>  'http://example.com/some/referer' )); $response  =  $http -> send (); $headers  =  $response ->getHeaders(); $body  =  $response ->getBody();
pecl_http Request Pooling $pool  = new HttpRequestPool; foreach  ( $urls  as  $url ) { $request  = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach  ( $pool  as  $request ) { echo   $request ->getUrl(), PHP_EOL; echo   $request ->getResponseBody(), PHP_EOL; }
HTTP Resources RFC 2616 HyperText Transfer Protocol RFC 3986 Uniform Resource Identifiers &quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) &quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) &quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by  Chris Shiflett Ben Ramsey's blog series on HTTP
Step #2:Analysis
Tidy Extension $config  =  array ( 'output-xhtml'  => true); $tidy  = tidy_parse_string( $markupString ,  $config ); $tidy  = tidy_parse_file( $markupFilePath ,  $config ); $output  = tidy_get_output( $tidy );
DOM Extension $doc  = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems  =  $doc ->getElementsByTagName( 'li' ); $xpath  = new DOMXPath( $doc ); $listItems  =  $xpath ->query( '//ul/li' ); foreach  ( $listItems  as  $listItem ) { echo   $listItem ->nodeValue, PHP_EOL; }
SimpleXML Extension $sxe  = new SimpleXMLElement( $markupString ); $sxe  = new SimpleXMLElement( $filePath , null, true); echo   $sxe ->body->ul->li[0], PHP_EOL; $children  =  $sxe ->body->ul->li; $children  =  $sxe ->body->ul->children(); foreach  ( $children  as  $li ) { echo   $li , PHP_EOL; } echo   $sxe ->body->ul[ 'id' ]; $attributes  =  $sxe ->body->ul->attributes(); foreach  ( $attributes  as  $name  =>  $value ) { echo   $name ,  '=' ,  $value , PHP_EOL; }
XMLReader Extension $doc  = XMLReader::xml( $xmlString ); $doc  = XMLReader::open( $filePath ); while  ( $doc -> read ()) { if  ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
CSS Selector Libraries phpQuery Simple HTML DOM Parser Zend_Dom_Query $doc1  = phpQuery::newDocumentFile( $markupFilePath ); $doc2  = phpQuery::newDocument( $markupString ); $listItems  = pq( 'ul > li' );  // uses $doc2 $listItems  = pq( 'ul > li' ,  $doc1 );
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs, Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions

Web Scraping with PHP

  • 1.
    Web Scraping with Matthew Turland Acadiana Open Source Group April 30, 2009
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    Why Is ItUseful?
  • 7.
  • 8.
  • 9.
  • 10.
    With plain text,we give ourselves the ability to manipulate knowledge, both manually and programmatically, using virtually every tool at our disposal. 3.14 The Power of Plain Text, The Pragmatic Programmer
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    No Nice NeatData Package
  • 16.
  • 17.
  • 18.
    The Web WeWeave GET / HTTP/1.1 User-Agent: ... HTTP/1.1 200 OK Content-Type: ...
  • 19.
    GET /index.php?foo=bar HTTP/1.1 <a href= &quot;/index.php?foo=bar&quot; > Index </a> <form method= &quot;post&quot; action= &quot;/index.php&quot; > <input name= &quot;foo&quot; value= &quot;bar&quot; /> </form> POST /index.php HTTP/1.1 foo = bar Browsing -> Requests
  • 20.
    HTTP/1.1 200 OKContent-Type : image/gif Content-Length: 8558 Responses -> Rendered Elements <img src= &quot;/intl/en_ALL/images/logo.gif&quot; /> GET /intl/en_ALL/images/logo.gif HTTP/1.1 Host: google.com
  • 21.
    Not As EasyAs It Looks
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    HTTP Client LibrariesPEAR::HTTP_Client pecl_http Zend_Http_Client Streams , cURL
  • 31.
    Simple Streams Example$uri = 'http://www.example.com/some/resource' ; $get = file_get_contents( $uri ); $context = stream_context_create( array ( 'http' => array ( 'method' => 'POST' , 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded' , 'content' => http_build_query( array ( 'var1' => 'value1' , 'var2' => 'value2' )) ) ) ); $post = file_get_contents( $uri , false, $context );
  • 32.
    pecl_http Example $http = new HttpRequest( $uri ); $http ->enableCookies(); $http ->setMethod(HTTP_METH_POST); $http ->addPostFields( array ( 'var1' => 'value1' )); $http ->setOptions( 'useragent' => 'PHP ' . phpversion (), 'referer' => 'http://example.com/some/referer' )); $response = $http -> send (); $headers = $response ->getHeaders(); $body = $response ->getBody();
  • 33.
    pecl_http Request Pooling$pool = new HttpRequestPool; foreach ( $urls as $url ) { $request = new HttpRequest( $url , HTTP_METH_GET); $pool ->attach( $request ); } $pool -> send (); foreach ( $pool as $request ) { echo $request ->getUrl(), PHP_EOL; echo $request ->getResponseBody(), PHP_EOL; }
  • 34.
    HTTP Resources RFC2616 HyperText Transfer Protocol RFC 3986 Uniform Resource Identifiers &quot;HTTP: The Definitive Guide&quot; (ISBN 1565925092) &quot;HTTP Pocket Reference: HyperText Transfer Protocol&quot; (ISBN 1565928628) &quot;HTTP Developer's Handbook&quot; (ISBN 0672324547) by Chris Shiflett Ben Ramsey's blog series on HTTP
  • 35.
  • 36.
    Tidy Extension $config = array ( 'output-xhtml' => true); $tidy = tidy_parse_string( $markupString , $config ); $tidy = tidy_parse_file( $markupFilePath , $config ); $output = tidy_get_output( $tidy );
  • 37.
    DOM Extension $doc = new DOMDocument; $doc ->loadHTML( $htmlString ); $doc ->loadHTMLFile( $htmlFilePath ); $listItems = $doc ->getElementsByTagName( 'li' ); $xpath = new DOMXPath( $doc ); $listItems = $xpath ->query( '//ul/li' ); foreach ( $listItems as $listItem ) { echo $listItem ->nodeValue, PHP_EOL; }
  • 38.
    SimpleXML Extension $sxe = new SimpleXMLElement( $markupString ); $sxe = new SimpleXMLElement( $filePath , null, true); echo $sxe ->body->ul->li[0], PHP_EOL; $children = $sxe ->body->ul->li; $children = $sxe ->body->ul->children(); foreach ( $children as $li ) { echo $li , PHP_EOL; } echo $sxe ->body->ul[ 'id' ]; $attributes = $sxe ->body->ul->attributes(); foreach ( $attributes as $name => $value ) { echo $name , '=' , $value , PHP_EOL; }
  • 39.
    XMLReader Extension $doc = XMLReader::xml( $xmlString ); $doc = XMLReader::open( $filePath ); while ( $doc -> read ()) { if ( $doc ->nodeType == XMLReader::ELEMENT) { var_dump ( $doc ->localName); var_dump ( $doc ->hasValue); var_dump ( $doc ->value); var_dump ( $doc ->hasAttributes); var_dump ( $doc ->getAttribute( 'id' )); } }
  • 40.
    CSS Selector LibrariesphpQuery Simple HTML DOM Parser Zend_Dom_Query $doc1 = phpQuery::newDocumentFile( $markupFilePath ); $doc2 = phpQuery::newDocument( $markupString ); $listItems = pq( 'ul > li' ); // uses $doc2 $listItems = pq( 'ul > li' , $doc1 );
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.