Image scraping using
Python
                                By
Supervisor: Dr Ravindra Kumar
                                Arnav Lakha 1/20/FET/BCS/112
                                Shashank Rai 1/20/FET/BCS/106
                                Shourya Ahuja 1/20/FET/BCS/115
                                Arun 1/20/FET/BCS/086
  11/29/2023                                               1
                                Mohit Chaudhary 1/20/FET/BCS/087
  Outline
• Introduction to scraping
• What is image scraping ?
• Is Image Scraping Legal?
• Introduction to python Scraper
• How to perform image scraping
• Some scraping knowledge
                                   2
  TABLE OF CONTENTS
1)Introduction
2) Problem Statements
3) Objectives
4) Hardware and software requirements
5) Literature Review
6)System Design
7) Methodology
8) Expected Outcome Of project /Result
9) Conclusion & Future Scope
10)References
                                         3
                          Introduction
What is image scraping ?
• Image scraping is a subset of the web scraping technology. While
  web scraping deals with all forms of web data extraction, image
  scraping only focuses on the media side – images, videos, audio,
  and so on.
• Image scraping is a technique used in web scraping to
  extract image data from web sources in various formats,
  including JPEG, PNG, and GIF. The term typically refers
  to automated processes implemented using a Python library.
• Scraping images has become a powerful method for collecting data
  and insights with the increasing importance of visual content.
                                                                     4
Problem Statements
• From retail and real estate to tourism and hospitality, images play a
  vital role in influencing customer decisions. Hence, it is important for
  brands to see what kinds of photos are turning prospects into
  customers.
• On the other side, customers go through numerous products and
  images before settling on a final choice. Similarly, analysts browse
  several pages and analyze hundreds of images to gain any meaningful
  insight. In such cases, they have to download these images, which is
  extremely error-prone and time-consuming when done manually.
• In these scenarios, we need image scraping
11/29/2023                                                                   5
Introduction to scraping
• There are many different tools for scraping available,
  which differ in their functionality and use.
• Tools and frameworks come and go, choose the one
  that fits the job.
• Scraping: the actual extraction of data / information
  from a web page
                                                           6
 What is image scraping ?
• Image scraping is a subset of the web scraping
  technology. While web scraping deals with all forms of
  web data extraction, image scraping only focuses on
  the media side – images, videos, audio, and so on.
                                                           7
Is Image Scraping Legal?
Like more generalized web scraping, image scraping is a method for downloading
website content. It's not illegal, but there are some rules and best practices you should
follow. First, you should avoid scraping a website if it explicitly states that it does not
want you to. You can find this out by looking for a /robots.txt file on the target site.
Most websites allow web crawling because they want search engines to index their
content. You can scrape such websites since their images are publicly available.
However, just because you can download an image, that doesn't mean you can use it as
if it were your own. Most websites license their images to prevent you from
republishing them or reusing them in other ways. Always assume that you cannot reuse
images unless there is a specific exemption.
Best practices for image scraping to avoid common challenges
It is essential to scrape image data cautiously and follow best practices in order to avoid
technical and legal issues. Here are some best practices for image scraping:
•Check image formats and sizes: Images can come in various formats, such as JPEG,
GIF, and sizes, such as small thumbnails. Ensure that your image scraper can handle
all of these formats and different image sizes.
•Follow ethical and legal guidelines: Image scraping may be illegal under certain
conditions, such as when it violates copyright laws. Check the terms of service and the
Robots.txt file of the website you intend to scrape to ensure your data collection activity
does not violate any rules or policies. For example, most websites employ rate limits to
manage crawling traffic and prevent the overuse of APIs. Check for any
rate limits imposed by the website’s API and comply with them to avoid being blocked.
•Respecting the website’s server and bandwidth: Limit the frequency and volume of
your requests or add time delays between your requests. You can also use caching
techniques to avoid requesting the same image data multiple times.
                                                                                              9
                Image scraping with
                     Python
You can scrape images from a web page using Python by following these steps:
1.Install the necessary libraries: The scraping library you choose will depend on your
specific data collection requirements. Beautiful Soup and Requests are typically the easiest
for basic image scraping tasks. At the same time, Scrapy and Pillow libraries provide more
advanced functions for web scraping images. Selenium is generally used for
scraping dynamic web pages, which requires user interaction, such as clicking buttons or
navigating menus.
You can install the desired library using the pip command, the Python package installer. For
example, to install Requests, type the “pip install requests” command into your prompt or
terminal.
2.Identify the image URLs on a web page you wish to scrape: You can inspect the
HTML source code of a page using developer tools in your browser. Image URLs are
generally included in the src attribute of a <img> tag in the HTML content (Figure 1). Copy
the image URL from the src attribute to use a Python library.
                                                                                          10
Introduction to python Scraper
 • A Python image scraper isn't just a tool for sharpening
   programming skills. We can use it to source images for a
   machine learning project, or generate site thumbnails.
                                                              11
How to perform image scraping ?
• Method 1: Using BeautifulSoup and Requests
• bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This
  module does not come built-in with Python. To install this type the below command in the
  terminal.
• pip install bs4
• requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does
  not come built-in with Python. To install this type the below command in the terminal.
• pip install requests
• Approach:
•   Import module
•   Make requests instance and pass into URL
•   Pass the requests into a Beautifulsoup() function
                                                                                                    12
•   Use ‘img’ tag to find them all tag (‘src ‘)
         3.Request the target web page: Once you’ve identified the
         target URLs, you can send a request to the web page containing
         the images you want to scrape. For instance, if you are using the
         Requests library to scrape an Amazon product image, you can
         use the following code.
         url = ‘https://amazon.com/xyz’
         response = requests.get(url)
         4.Parse the HTML content: You can use a Python library like
         Beautiful Soup or lxml to parse the HTML content of the response.
         5.Extract the image URLs : To extract the image URLs from all
         image tags, you can use the ‘src’ attribute to specify the URL of
         the image file that needs to be downloaded.
11/29/2023                                                                   13
         3.Download all the images: Once you have the image URLs, you
         must download the images from the URLs. Python includes several
         built-in modules for downloading images from web pages, such as
         urllib, urllib2 and Requests.
             3. urllib: It is part of the Python standard library. You can download all the
                images using the “urlretrieve()” function.
             4. urllib2: It provides more advanced features for sending HTTP requests. You
                can use the “urlopen()” function to open a connection to the image URL and
                use the “read()” method to read the image data.
             5. Requests: It is a third-party Python library. You can use the “get()” function
                to send a request to the target URL and use the content attribute to access
                the image data.
         4.Save the downloaded image data: Finally, save the downloaded
         images to your local file system. For example, you can use the “os”
         module to save an image to the directory /path/to/images. It keeps
         the image data in a file called image.jpg in the directory path, but you
         can change the image filename to suit your needs.
11/29/2023                                                                                       14
Some scraping knowledge
• Python : Language used to extract images from the
  webpage
• HTTP: the communication protocol
• HTML: the language in which web pages are defined
• JS: javascript (code executing in the browser)
• CSS: style sheets, how web pages are styled.
  Important, but does not contain data.
• JPG, PNG, BMP: images
• CSV / TXT / JSON / XML: data
                                                      15
PROBLEM STATEMENT
11/29/2023          16
                                     Project OBJECTIVES
   To study/examine the existing ..
   To identify the gaps in the existing techniques and find the scope of ...
   To Evaluate and implement the ….
11/29/2023                                                                      17
METHODOLOGY
11/29/2023    18
EXPECTED OUTCOME
• This aims to …
11/29/2023         19
                                       REFERENCES
• https://research.aimultiple.com/image-scraping/
11/29/2023                                          20
Thank You!
             21