MODULE – 4
Web Scraping
Web scraping is the process of extracting data from websites automatically using programs
or scripts.
It involves fetching web pages, parsing their content, and extracting specific information
such as text, images, links, or tables.
Web scraping is widely used for data collection, market research, price monitoring,
sentiment analysis, and more.
1. What is Web Scraping?
Web scraping refers to the automated extraction of data from web pages. Instead of
manually copying data, a script interacts with a website’s HTML structure to retrieve
targeted information. For example, scraping an e-commerce website to collect product
names and prices or scraping a news site for article headlines.
Key Concepts:
• HTML Structure: Websites are built using HTML, which organizes content into tags
(e.g., <div>, <p>, <h1>). Scraping involves navigating this structure to locate data.
• Web Requests: Scraping starts by sending HTTP requests to a website’s server to
retrieve its HTML content.
• Parsing: The retrieved HTML is parsed to extract specific elements using libraries or
tools.
• Ethics and Legality: Always check a website’s robots.txt file and terms of service to
ensure scraping is allowed. Excessive requests can overload servers, so ethical
scraping includes rate-limiting.
Why Use Web Scraping?
• Automation: Saves time compared to manual data collection.
• Data Analysis: Provides large datasets for research, machine learning, or business
intelligence.
• Real-Time Updates: Tracks changes like price drops or news updates.
• Versatility: Applicable to various domains, from finance to social media.
Challenges in Web Scraping:
• Dynamic Websites: Sites using JavaScript (e.g., React, Angular) may require tools
like Selenium to render content.
• Anti-Scraping Measures: CAPTCHAs, IP bans, or rate limits can block scrapers.
• Data Cleaning: Scraped data may need formatting or cleaning for analysis.
• Legal Risks: Unauthorized scraping can lead to legal consequences.
2. Steps in Web Scraping
Web scraping follows a systematic process to ensure accurate and efficient data
extraction. Below are the detailed steps, which are critical for exam answers:
Step 1: Identify the Target Website
• Choose the website and specific pages to scrape (e.g., a product listing page).
• Understand the website’s structure by inspecting its HTML using browser developer
tools (right-click → Inspect).
Step 2: Send HTTP Request to Fetch the Web Page
• Use a library like requests in Python to send an HTTP GET request to the website’s
URL.
• The server responds with the HTML content of the page.
Step 3: Parse the HTML Content
• The HTML content is parsed to navigate its structure and locate specific elements.
• Libraries like BeautifulSoup or lxml are used to parse HTML and extract data based
on tags, classes, or IDs.
Step 4: Extract the Desired Data
• Identify the HTML elements containing the target data (e.g., product names, prices,
or links).
• Use selectors like CSS classes, IDs, or tag names to extract data.
• Store the extracted data in a structured format (e.g., lists, dictionaries, or CSV files).
Step 6: Store the Data
• Save the scraped data in a suitable format:
o CSV/Excel: For tabular data.
o JSON: For structured data.
o Database: For large-scale storage (e.g., SQLite, MySQL).
• Example: Write data to a CSV file using Python’s csv module.
Step 7: Clean and Process the Data
• Remove unwanted characters, duplicates, or missing values.
• Format data for analysis (e.g., convert prices to numerical values).
Step 8: Handle Errors and Ethics
• Implement error handling for failed requests or missing elements.
• Respect the website’s terms and legal boundaries.
Requests Module in Python
1. requests.get()
• Purpose: Sends a GET request to retrieve data from a server.
• Common Use: Fetching web pages, APIs, etc.
Example:
import requests
r = requests.get('https://example.com')
print(r.text)
Note: r.text returns the page content as a string.
2. requests.post()
• Purpose: Sends a POST request to submit data to a server.
• Common Use: Form submissions, login forms, API data sending.
Example:
data = {'username': 'user', 'password': 'pass'}
r = requests.post('https://example.com/login', data=data)
print(r.status_code)
Note: data is sent in the request body.
3. requests.put()
• Purpose: Sends a PUT request to update existing data on a server.
• Common Use: Update a resource (like editing a profile).
Example:
data = {'name': 'New Name'}
r = requests.put('https://example.com/profile/1', data=data)
print(r.status_code)
Note: Similar to POST but used for updates.
4. requests.delete()
• Purpose: Sends a DELETE request to remove data on the server.
• Common Use: Deleting a resource (like deleting a user).
Example:
r = requests.delete('https://example.com/profile/1')
print(r.status_code)
Note: 204 status code usually means "Deleted Successfully".
5. requests.head()
• Purpose: Sends a HEAD request (same as GET but without body content).
• Common Use: Check headers/meta info without downloading data.
Example:
r = requests.head('https://example.com')
print(r.headers)
Note: Useful for checking content type, server type, etc.
6. requests.options()
• Purpose: Sends an OPTIONS request to find out allowed operations on a server.
• Common Use: Know what methods (GET, POST, DELETE, etc.) are supported.
Example:
r = requests.options('https://example.com')
print(r.headers.get('Allow'))
Note: Returns allowed HTTP methods for that URL.
Web Scraping in Python using BeautifulSoup
• BeautifulSoup is a Python library used to parse HTML and XML documents.
• It creates a parse tree for parsed pages that makes it easy to extract data from
HTML.
• It is often used along with requests library to download web pages.
Installation
pip install beautifulsoup4
pip install requests
Core Components of BeautifulSoup
Concept Description
Parser html.parser, lxml, or html5lib can be used.
Tag Represents an HTML tag like <div>, <p>, <a>, etc.
NavigableString Text inside a tag.
Attributes Properties inside a tag like href, id, class, etc.
Methods Functions to search, traverse, and modify the parse tree.
🛠 How BeautifulSoup Works
1. Download HTML page (using requests).
2. Parse the HTML content.
3. Search, filter, and extract specific data.
4. Process or store the extracted data.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
Important BeautifulSoup Methods
Method Purpose
find() Finds the first matching tag.
find_all() Finds all matching tags.
select() Finds elements using CSS selectors.
get_text() Extracts text from an element.
attrs Gets tag attributes as a dictionary.
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1').text
print(title)
What it does: Downloads the page, parses HTML, finds the first <h1> tag, and prints its
text.
Extracting Multiple Items
links = soup.find_all('a')
for link in links:
print(link['href'])
What it does: Prints all the hyperlinks (href) on the page.
CSS Selector Example
paragraphs = soup.select('div.content p')
for p in paragraphs:
print(p.text)
What it does: Selects all <p> tags inside a <div> with class content.
4. Example of Web Scraping Using Python and BeautifulSoup
Below is a simple, minimal example of web scraping using Python, requests, and
BeautifulSoup. The code scrapes book titles from a sample website
(http://books.toscrape.com), which is designed for learning web scraping. This example is
tailored for exam purposes: short, clear, and easy to understand.
Prerequisites
• Install required libraries:
pip install requests beautifulsoup4
• Basic understanding of Python and HTML.
Code Example
import requests
from bs4 import BeautifulSoup
# Step 1: Send HTTP request to the website
url = "http://books.toscrape.com"
response = requests.get(url)
# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")
# Step 3: Extract book titles
books = soup.find_all("h3") # Titles are in <h3> tags
for book in books:
title = book.find("a")["title"] # Get title from <a> tag
print(title)
# Step 4: Save to a file (optional)
with open("books.txt", "w") as file:
for book in books:
title = book.find("a")["title"]
file.write(title + "\n")
Data Acquisition by Scraping Web Applications
Data acquisition by scraping web applications involves programmatically extracting data
from websites, often by interacting with their interfaces, such as submitting forms, fetching
pages, or parsing HTML content.
Web scraping is a powerful technique used for collecting structured or unstructured data
for analysis, research, or business intelligence.
1. Introduction to Data Acquisition by Scraping Web Applications
Data acquisition through web scraping involves retrieving data from web applications—
dynamic or static websites that deliver content via HTML, JavaScript, or APIs.
Web applications often require interaction, such as submitting forms or navigating pages,
to access specific data.
Scraping web applications is more complex than scraping static pages because it may
involve handling user inputs, dynamic content, or session management.
Key Concepts:
• Web Applications: Websites with interactive features (e.g., search forms, login
pages, or dynamic content loaded via JavaScript).
• Scraping Goals: Extract data like search results, user inputs, or numerical data for
analysis.
• Tools: Python libraries like requests, BeautifulSoup, and Selenium are commonly
used.
• Applications: Price monitoring, sentiment analysis, data aggregation, or academic
research.
Why Scrape Web Applications?
• Automation: Automates data collection from interactive websites.
• Dynamic Data: Accesses real-time or user-specific data (e.g., search results).
• Numerical Analysis: Provides datasets for statistical or predictive modeling.
• Competitive Advantage: Enables businesses to monitor competitors or market
trends.
Challenges:
• Dynamic Content: JavaScript-rendered pages require browser automation tools.
• Form Handling: Submitting forms requires mimicking user inputs (e.g., POST
requests).
• Anti-Scraping Measures: CAPTCHAs, rate limits, or IP bans.
• Data Cleaning: Scraped data often needs formatting or validation.
import requests
from bs4 import BeautifulSoup
url = 'https://quotes.toscrape.com/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
quotes = soup.find_all('span', class_='text')
for q in quotes:
print(q.text)
Submitting Forms and Performing Numerical Analysis
Web applications often use forms to collect user inputs (e.g., search queries, login
credentials) and return dynamic results. Scraping such applications requires simulating
form submission, which involves:
• Identifying Form Elements: Inspect the HTML to find the form’s action URL,
method (GET/POST), and input fields (e.g., <input name="query">).
• HTTP Requests:
o GET: Parameters are appended to the URL (e.g., ?query=python).
o POST: Data is sent in the request body, often requiring a payload.
• Session Management: Some forms require cookies or session tokens to maintain
state.
• Tools:
o requests: For sending GET/POST requests with form data.
o Selenium: For forms on JavaScript-heavy sites or those requiring clicks.
What does "Submitting Form" mean?
• In web scraping, submitting a form is sending data to the server just like filling a
form on a website and clicking "Submit".
• Instead of manually filling the form, you send a POST request with the required
data.
How Form Submission Works (Internally)
1. Form on website usually has an action URL and input fields (name, email, etc.).
2. Scraper builds a data dictionary with form fields and their values.
3. Scraper sends a POST request to the form's action URL.
4. Server processes the data and responds with success page or error message.
5. Important Concepts for Form Submission
Concept Explanation
action URL Where the form data needs to be posted.
method POST (common) or GET (rare for forms).
Hidden Fields Some forms have hidden fields (tokens) for security.
Cookies Sessions and cookies might be needed for login forms.
CSRF Token Some forms use CSRF tokens to prevent fake submissions.
Numerical Analysis
Once data is scraped, numerical analysis can be performed to derive insights. This
involves:
• Data Extraction: Collecting numerical data (e.g., prices, ratings, quantities).
• Data Cleaning: Converting strings to numbers, handling missing values.
• Analysis Techniques:
o Descriptive Statistics: Mean, median, standard deviation.
o Visualization: Plotting data using libraries like matplotlib.
o Predictive Modeling: Using scikit-learn for regression or classification.
• Tools: pandas for data manipulation, numpy for calculations, matplotlib for
visualization.
Practical Steps for Form Submission and Analysis
1. Inspect the form using browser developer tools to identify the action URL, method,
and input names.
2. Send a GET/POST request with the form data using requests.
3. Parse the response HTML to extract numerical data.
4. Store the data in a structured format (e.g., list or DataFrame).
5. Perform numerical analysis (e.g., calculate averages or plot trends).
Code Example: Submitting a Search Form and Analyzing Results
This example submits a search query to http://quotes.toscrape.com/search.aspx (a sample
site) and calculates the average length of quote texts.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Step 1: Submit form (POST request)
url = "http://quotes.toscrape.com/search.aspx"
data = {"author": "Albert Einstein", "tag": "science"} # Form data
response = requests.post(url, data=data)
# Step 2: Parse response
soup = BeautifulSoup(response.text, "html.parser")
quotes = soup.find_all("span", class_="content") # Quote texts
# Step 3: Extract and analyze
lengths = [len(quote.text) for quote in quotes]
df = pd.DataFrame(lengths, columns=["Quote Length"])
# Step 4: Numerical analysis
mean_length = df["Quote Length"].mean()
print(f"Average quote length: {mean_length:.2f} characters")
# Step 5: Save to CSV
df.to_csv("quote_lengths.csv", index=False)
Fetching Web Pages
Fetching web pages is the first step in web scraping, involving sending HTTP requests to
retrieve HTML content. Key points:
• HTTP Methods:
o GET: Retrieves data (e.g., a webpage or API response).
o POST: Sends data to the server (e.g., form submissions).
• Status Codes:
o 200: Success.
o 403: Forbidden (scraping may be blocked).
o 404: Page not found.
• Headers: Include User-Agent to mimic a browser and avoid detection.
• Tools:
o requests: Simple and efficient for static pages.
o urllib: Built-in but less user-friendly.
o Selenium: For dynamic pages requiring JavaScript rendering.
What does "Fetching" mean?
• Fetching = Sending a GET request to the server asking for a web page.
• The server responds with the HTML content of the page.
• This HTML content can then be parsed and scraped using libraries like
BeautifulSoup.
How Fetching Works (Internally)
1. Browser or Scraper sends a GET request to the server.
2. Server processes the request and responds with HTML.
3. Scraper (your Python code) receives the page and saves it in memory.
4. You can then parse it using BeautifulSoup
Important Fetching Concepts
Concept Explanation
Headers You can send headers like User-Agent to act like a browser.
Status Code After fetching, check if page is delivered (e.g., 200 OK).
Timeout Set a time limit to prevent scraper from hanging forever.
Session A session maintains cookies across multiple requests.
Practical Steps
1. Identify the target URL.
2. Send a GET request using requests.get(url, headers=headers).
3. Check the response status code (response.status_code).
4. Access the HTML content (response.text) for parsing.
Code Example: Fetching a Web Page
This example fetches the homepage of http://books.toscrape.com.
import requests
# Step 1: Fetch web page
url = "http://books.toscrape.com"
headers = {"User-Agent": "Mozilla/5.0"} # Mimic browser
response = requests.get(url, headers=headers)
# Step 2: Check status and print content
if response.status_code == 200:
print("Page fetched successfully!")
print(response.text[:500]) # First 500 characters of HTML
else:
print(f"Failed: Status code {response.status_code}")
Downloading Web Pages Through Form Submission
Downloading Pages via Forms
Some web applications require form submission to access specific pages (e.g., search
results or filtered data). This involves:
• Form Analysis: Identify the form’s action URL, method (GET/POST), and input fields.
• Simulating Submission:
o GET: Append form data to the URL.
o POST: Send data in the request body.
• Downloading Content: Save the response HTML or extracted data to a file.
• Challenges:
o Session cookies may be required (use requests.Session()).
o Dynamic forms may need Selenium for JavaScript rendering.
Practical Steps
1. Inspect the form to find the action URL and input names.
2. Send a POST request with form data using requests.post.
3. Save the response HTML to a file or parse it for data.
4. Handle cookies or sessions if required.
Code Example: Downloading a Page After Form Submission
This example submits a form on http://quotes.toscrape.com/search.aspx and saves the
resulting HTML.
import requests
# Step 1: Submit form (POST request)
url = "http://quotes.toscrape.com/search.aspx"
data = {"author": "Albert Einstein", "tag": "science"}
response = requests.post(url, data=data)
# Step 2: Save the response HTML
with open("search_results.html", "w", encoding="utf-8") as file:
file.write(response.text)
print("Page downloaded successfully!")
• Some web pages only show data after you submit a form (e.g., search forms, login
pages).
• You can’t just do requests.get(url) because data is generated after form
submission.
• Solution: You must simulate the form submission by sending a POST request with
the form data.
🛠 How it Works Internally (Step-by-Step)
1. Visit the website manually and inspect the form (<form> tag).
2. Identify:
o action URL (where the form sends data)
o method (usually POST)
o Input fields (name and value)
3. Build a data dictionary in Python with field names and values.
4. Send a POST request to the action URL with this data.
5. Get the response — the server sends the new page HTML.
6. Parse the HTML using BeautifulSoup.
Important Concepts to Remember
Term Meaning
Form action URL Where the form sends the data.
Method Usually POST (sometimes GET).
Input fields Names of the fields like username, password.
Hidden fields Extra fields like CSRF tokens.
Session Use requests.Session() if you need cookies to persist.
Very Small Code to Download Videos or Large Files
import requests
url = 'https://example.com/video.mp4'
r = requests.get(url, stream=True)
with open('video.mp4', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
CSS Selectors for Web Scraping
CSS selectors are a powerful tool used in web scraping to locate and extract specific
elements from HTML documents based on their tags, classes, IDs, attributes, or structure.
They are widely used with libraries like BeautifulSoup (via the select method) or Scrapy to
target elements precisely.
1. Introduction to CSS Selectors
CSS (Cascading Style Sheets) selectors are patterns used to identify HTML elements based
on their properties, such as tag names, classes, IDs, attributes, or hierarchical
relationships.
In web scraping, CSS selectors allow you to pinpoint specific elements (e.g., paragraphs,
links, or divs) within a webpage’s HTML structure for extraction.
Key Concepts:
• HTML Structure: Websites are built using HTML, with elements organized in a tree-
like structure (DOM - Document Object Model). CSS selectors navigate this
structure to locate elements.
• Web Scraping Context: CSS selectors are used with libraries like BeautifulSoup
(soup.select()) or Scrapy to extract data from HTML.
• Advantages:
o Precise targeting of elements.
o Simpler syntax compared to XPath in many cases.
o Familiar to those with web development experience.
• Applications: Extracting product prices, article titles, user comments, or any
structured data from websites.
• Ethics: Ensure scraping complies with the website’s robots.txt and terms of service.
Avoid excessive requests to prevent server overload.
Why Use CSS Selectors in Web Scraping?
• Precision: Target elements by class, ID, or attributes with minimal code.
• Flexibility: Combine selectors to navigate complex HTML structures.
• Readability: CSS selector syntax is intuitive (e.g., .product for a class).
• Compatibility: Supported by popular scraping tools like BeautifulSoup, Scrapy, and
Selenium.
Challenges:
• Dynamic Content: JavaScript-rendered elements may require tools like Selenium.
• HTML Changes: Website updates can break selectors.
• Specificity: Overly broad selectors may return unwanted elements.
• Learning Curve: Complex selectors (e.g., pseudo-classes) require practice.
2. Types of CSS Selectors
CSS selectors come in various types, each suited for different use cases in web scraping.
Below is a detailed breakdown of the most common selectors, including their syntax,
purpose, and examples.
2.1 Basic Selectors
These target elements based on simple properties like tags, classes, or IDs.
• Element Selector:
o Purpose: Selects all elements of a specific tag.
o Example: p selects all <p> tags.
o Use Case: Extract all paragraphs from a webpage.
• Class Selector:
o Purpose: Selects elements with a specific class attribute.
o Example: .product selects <div class="product">.
o Use Case: Extract all product listings with a common class.
• ID Selector:
o Purpose: Selects a single element with a specific ID.
o Example: #header selects <div id="header">.
o Use Case: Extract a unique header element.
• Universal Selector:
o Purpose: Selects all elements in the document.
o Example: * selects every element (rarely used due to lack of specificity).
o Use Case: Debugging or selecting all elements within a specific container.
2.2 Attribute Selectors
These target elements based on their attributes or attribute values.
• Attribute Presence:
o Syntax: [attribute]
o Purpose: Selects elements with a specific attribute.
o Example: [href] selects all elements with an href attribute (e.g., <a
href="...">).
o Use Case: Extract all links.
• Attribute Value:
o Syntax: [attribute="value"]
o Purpose: Selects elements with an exact attribute value.
o Example: [type="text"] selects <input type="text">.
o Use Case: Extract specific input fields in a form.
• Attribute Contains:
o Syntax: [attribute*="value"]
o Purpose: Selects elements whose attribute contains a substring.
o Example: [class*="product"] selects elements with "product" in their class.
o Use Case: Extract elements with partial class names.
• Attribute Starts With:
o Syntax: [attribute^="value"]
o Purpose: Selects elements whose attribute starts with a value.
o Example: [href^="https"] selects links starting with "https".
o Use Case: Extract secure links.
• Attribute Ends With:
o Syntax: [attribute$="value"]
o Purpose: Selects elements whose attribute ends with a value.
o Example: [src$=".jpg"] selects images ending with ".jpg".
o Use Case: Extract image files.
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Select elements using CSS selector
titles = soup.select('h1.title') # Selects all <h1 class="title">
for title in titles:
print(title.text)
Numerical analysis
What is Numerical Analysis?
• Numerical Analysis is a field of mathematics that designs methods to
approximate solutions for mathematical problems.
• Exact answers are often impossible or very costly to compute.
• So, we create algorithms that give almost-correct results (with acceptable error).
• Engineering, science and economics
Why Numerical Analysis?
Reason Explanation
Complex
Some equations don't have simple analytical solutions.
problems
Speed Approximate methods are faster.
In real-world systems (physics, engineering, finance), perfect answers are
Practicality
not necessary — good enough is enough.
Reason Explanation
Computers Digital computers work with discrete approximations anyway.
🛠 Topics in Numerical Analysis (Important)
Topic What it Means
Root Finding Find where a function equals zero (f(x) = 0).
Interpolation Estimate unknown values between known data points.
Numerical Integration Approximate the area under a curve.
Numerical Differentiation Approximate the derivative of a function.
Solving Systems of Equations Find values satisfying many simultaneous equations.
Eigenvalues and Eigenvectors Important in stability, physics, ML.
Numerical Analysis in Python
• Python offers powerful libraries for numerical analysis:
o numpy
o scipy
o sympy
o matplotlib (for visualization)
MINI CODES FOR IMPORTANT CONCEPTS
1. Root Finding (Finding solutions for f(x) = 0)
• Goal: Find values of x where a function f(x) = 0.
• Famous methods: Bisection Method, Newton-Raphson Method.
from scipy.optimize import fsolve
def func(x):
return x**3 - 5*x + 1
root = fsolve(func, x0=0)
print(root)
2. Interpolation (Estimating between points)
• Given a few known points, guess values in between.
• Linear interpolation and spline interpolation are common.
from scipy.interpolate import interp1d
x = [0, 1, 2]
y = [0, 1, 4]
f = interp1d(x, y)
print(f(1.5)) # estimate at x=1.5
3. Numerical Integration (Approximating Area)
• Calculate the integral (area under curve) numerically when exact integral is hard.
• Trapezoidal Rule, Simpson’s Rule, Monte Carlo methods.
from scipy.integrate import quad
def f(x):
return x**2
area, error = quad(f, 0, 3)
print(area)
4. Numerical Differentiation (Approximating Derivatives)
• Estimate derivatives without symbolic differentiation.
• Using finite differences.
import numpy as np
def f(x):
return x**2
x = 2.0
h = 1e-5
derivative = (f(x+h) - f(x-h)) / (2*h)
print(derivative)
5. Solving Systems of Linear Equations (Ax = b)
• Matrix equations arise in engineering, physics, ML, etc.
• Solve equations like:
2x + 3y = 8
5x + y = 7
import numpy as np
A = np.array([[2, 3], [5, 1]])
b = np.array([8, 7])
x = np.linalg.solve(A, b)
print(x)
NumPy Essentials
NumPy (Numerical Python) is a fundamental Python library for scientific computing,
providing support for large, multi-dimensional arrays and matrices, along with a collection
of mathematical functions to operate on these arrays efficiently.
It is widely used in data science, machine learning, and engineering for numerical
computations.
1. Introduction to NumPy
NumPy is an open-source library that enables fast and efficient numerical computations in
Python.
It is the backbone of many scientific Python libraries, such as pandas, SciPy, and scikit-
learn.
NumPy’s primary data structure is the ndarray (N-dimensional array), which allows for
vectorized operations, eliminating the need for slow Python loops.
Key Features:
• N-dimensional Arrays: Supports arrays of any dimension (1D, 2D, 3D, etc.).
• Vectorized Operations: Performs operations on entire arrays without loops.
• Broadcasting: Enables operations on arrays of different shapes.
• Mathematical Functions: Includes functions for linear algebra, statistics, and
random number generation.
• Performance: Written in C, making it much faster than Python’s built-in lists.
Why Use NumPy?
• Efficiency: Faster than Python lists for numerical tasks.
• Ease of Use: Simplifies complex mathematical operations.
• Versatility: Applicable in data analysis, machine learning, image processing, and
more.
• Interoperability: Integrates with other libraries like pandas and matplotlib.
Import Convention:
import numpy as np
2. NumPy Characteristics
NumPy’s power lies in its characteristics, which make it ideal for numerical computations.
These are critical for exam answers.
2.1 N-dimensional Array (ndarray)
• Definition: A homogeneous, multi-dimensional array of fixed-size elements.
• Homogeneous: All elements are of the same data type (e.g., int32, float64).
• Dimensions: Supports 1D (vectors), 2D (matrices), 3D, or higher-dimensional
arrays.
• Attributes:
o ndim: Number of dimensions (e.g., 2 for a 2D array).
o shape: Tuple of array dimensions (e.g., (3, 4) for 3 rows, 4 columns).
o size: Total number of elements.
o dtype: Data type of elements (e.g., int32, float64).
o itemsize: Size of each element in bytes.
2.2 Memory Efficiency
• Fixed Size: Unlike Python lists, NumPy arrays have a fixed size, reducing memory
overhead.
• Contiguous Memory: Elements are stored in contiguous memory blocks, improving
access speed.
• Typed Arrays: Explicit data types (e.g., float32) optimize memory usage.
2.3 Vectorization
• Operations are applied element-wise to entire arrays, eliminating loops.
• Example: arr + 5 adds 5 to every element.
2.4 Broadcasting
• Allows operations on arrays of different shapes by automatically aligning
dimensions.
• Example: Adding a scalar to an array or a 1D array to a 2D array.
2.5 Performance
• Written in C, leveraging optimized libraries like BLAS and LAPACK.
• Outperforms Python lists by orders of magnitude for large datasets.
Categories of Array Manipulation in NumPy
Array manipulation refers to changing the shape, structure, elements, or order of arrays.
It includes reshaping, joining, splitting, changing dimensions, sorting, and more.
Here are the main categories:
1. Shape Manipulation
• Change the structure without changing data.
• Common functions:
o reshape() → change shape (rows × columns).
o ravel() → flatten array to 1D.
o flatten() → returns a flattened copy.
o resize() → change shape and size (can fill extra spaces).
• Shape manipulation is crucial for preparing data for ML models, math operations,
etc.
Example:
import numpy as np
a = np.array([[1, 2], [3, 4]])
print(a.reshape(4)) # [1 2 3 4]
2. Transposition and Axis Manipulation
• Rearranging the order of axes/dimensions.
• Common functions:
o transpose() → swaps rows with columns.
o T → shorthand for transpose.
o moveaxis() → move axes to new positions.
o swapaxes() → interchange two axes.
• Used heavily in matrix algebra, deep learning (tensor manipulation).
Example:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.T) # Transposes rows and columns
3. Joining Arrays
• Combine multiple arrays into one.
• Common functions:
o concatenate() → join along an existing axis.
o stack() → join along a new axis.
o hstack() → stack horizontally (columns).
o vstack() → stack vertically (rows).
• Helps to combine datasets, merge results.
Example:
import numpy as np
a = np.array([1, 2])
b = np.array([3, 4])
print(np.concatenate((a, b))) # [1 2 3 4]
4. Splitting Arrays
• Break one array into multiple smaller arrays.
• Common functions:
o split() → split into equal parts.
o hsplit() → split horizontally (columns).
o vsplit() → split vertically (rows).
• Very useful in preprocessing, chunking datasets.
Example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
print(np.split(a, 3)) # [array([1,2]), array([3,4]), array([5,6])]
5. Element-wise Manipulation
• Changing array values individually.
• Includes:
o Changing specific elements (arr[index] = value)
o Conditional operations (np.where()).
o Masking arrays (arr[arr>5]).
• Critical for filtering, modifying datasets based on conditions.
Example:
import numpy as np
a = np.array([1, 2, 3, 4])
a[2] = 10
print(a) # [1 2 10 4]
6. Copying and Viewing Arrays
• Control how arrays are copied or referenced.
• Common methods:
o copy() → creates a new independent array.
o view() → creates a new view (shares data with original).
• Important when you want changes to reflect (or not reflect) in the original array.
Example:
import numpy as np
a = np.array([1, 2, 3])
b = a.copy()
b[0] = 100
print(a, b) # [1 2 3] [100 2 3]
7. Sorting, Searching, Counting
• Organize and search inside arrays.
• Common functions:
o sort() → sort elements.
o argsort() → returns indices that would sort.
o where() → find indices matching condition.
o count_nonzero() → count non-zero elements.
• Very important in analysis, ranking, filtering.
Example:
import numpy as np
a = np.array([4, 2, 8, 6])
print(np.sort(a)) # [2 4 6 8]
NumPy Array
Creation and Methods
The ndarray is NumPy’s core data structure. Below is a detailed list of ways to create arrays
and their key methods.
3.1 Creating NumPy Arrays
Arrays can be created from lists, ranges, or specialized functions.
• From Python Lists:
arr = np.array([1, 2, 3]) # 1D array
arr2 = np.array([[1, 2], [3, 4]]) # 2D array
• Zeros and Ones:
zeros = np.zeros((2, 3)) # 2x3 array of zeros
ones = np.ones((2, 3)) # 2x3 array of ones
• Empty Array:
empty = np.empty((2, 3)) # Uninitialized array (random values)
• Range and Arange:
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
• Linspace:
linspace = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1]
• Identity Matrix:
identity = np.eye(3) # 3x3 identity matrix
• Random Arrays (Covered in Section 6):
rand = np.random.rand(2, 3) # 2x3 array of random floats [0, 1)
3.2 Key Array Methods
1) Indexing
Indexing means accessing elements from an array by their position (index).
Index starts from 0.
You can use positive indexes (from start) or negative indexes (from end).
1D arrays need a single index, 2D arrays need row and column index.
Syntax for 2D: array[row, column].
Indexing allows fast random access in NumPy arrays.
It's very similar to Python lists but more powerful.
1D Example: print(arr[2])
2D Example: print(arr[1, 2])
2) Slicing
Slicing means extracting a part of an array.
Syntax: start:end (start included, end excluded).
You can also add a step: start:end:step.
In 2D arrays, slicing happens over rows and columns separately.
Slicing does not copy, it gives a view (same memory).
You can reverse arrays with slicing ([::-1]).
Slicing is efficient and fast in NumPy.
1D Example: print(arr[1:4])
2D Example: print(arr[0:2, 1:3])
3) DataTypes
Each NumPy array has a single datatype for all its elements.
Use dtype to check or set the datatype.
Common dtypes are int32, float64, bool, etc.
Specifying dtype saves memory and speeds up computation.
Use astype() to convert datatype.
DataTypes are important when dealing with large data or ML models.
NumPy is very strict about type consistency.
1D Example: print(arr.dtype)
2D Example: print(arr.astype('float'))
4) Copy and View
Copy creates a new array (different memory).
View is a shallow copy (same memory, different object).
Changes in the original affect the view but not the copy.
Use .copy() to make full independent copies.
Slicing gives a view, not a copy.
Memory efficiency depends on whether you copy or view.
Important to manage large datasets without unnecessary copying.
1D Example: b = arr.copy()
2D Example: b = arr.view()
5) Shape
Shape shows the dimensions (rows, columns) of an array.
For 1D: number of elements, for 2D: (rows, columns).
Use .shape attribute to find or modify shape.
Shape is a tuple.
Shape consistency is important for matrix operations.
Changing shape without matching size gives error.
Shape handling is critical in Machine Learning, Deep Learning.
1D Example: print(arr.shape)
2D Example: print(arr.shape)
6) Reshape
Reshape changes the structure of an array without changing data.
Use .reshape(new_shape) method.
The total number of elements must remain the same.
-1 automatically calculates the missing dimension.
Useful in neural networks, image processing, etc.
You can convert between 1D, 2D, 3D easily.
Reshape doesn't copy data (usually).
1D Example: print(arr.reshape(3, 2))
2D Example: print(arr.reshape(1, 6))
7) Iterating
Iterating means going through each element of the array.
For 1D arrays, simple for loop works.
For 2D arrays, you loop row-by-row (nested loops or nditer).
Efficient iteration is important for performance.
You can use np.nditer() for multi-dimensional iteration.
Avoid manual loops if vectorized methods are possible.
Iteration is slower than vectorized operations.
1D Example: for x in arr: print(x)
2D Example: for x in arr: print(x)
8) Join
Join means combining two or more arrays.
Use functions like concatenate(), stack(), hstack(), vstack().
You can join arrays along different axes.
Arrays must match in size along the axis you are not joining.
Join is important for dataset merging.
Horizontal and vertical joins are very common.
Broadcasting rules apply if needed.
1D Example: np.concatenate((arr1, arr2))
2D Example: np.vstack((arr1, arr2))
9) Split
Split means dividing an array into multiple smaller arrays.
Use np.split(), np.array_split(), etc.
Useful in train-test splitting of datasets.
You specify the number of splits or indexes.
It is important for batch processing in ML.
Splitting maintains the data structure inside parts.
Uneven splits possible with array_split.
1D Example: np.array_split(arr, 3)
2D Example: np.array_split(arr, 2, axis=0)
import numpy as np
a = np.array([10, 20, 30, 40, 50, 60])
result = np.split(a, 3)
print(result)
import numpy as np
a = np.array([[1, 2], [3, 4], [5, 6]])
result = np.vsplit(a, 3)
print(result)
10) Search
Search finds indices of elements based on condition or value.
Use np.where(), np.searchsorted().
Very fast compared to manual looping.
Important for filtering, mask operations.
Search returns indexes where condition is true.
Greatly used in data cleaning.
You can combine search with filtering.
1D Example: np.where(arr == 5)
2D Example: np.where(arr > 10)
11) Sort
Sort arranges array elements in ascending order (or descending if reversed).
Use np.sort().
It creates a new sorted array; original remains unchanged.
You can sort by rows, columns in 2D arrays (with axis parameter).
Useful in ranking, ordering datasets.
Custom sort functions are also available.
For large datasets, fast built-in sort is preferred.
1D Example: np.sort(arr)
2D Example: np.sort(arr, axis=1)
12) Filter
Filter means selecting elements based on some condition.
You use a Boolean array (True/False) for filtering.
Masking helps keep only desired elements.
Vectorized filtering is much faster than looping.
Widely used in Machine Learning and data cleaning.
You can combine multiple conditions easily.
Filtering produces a smaller array.
1D Example: arr[arr > 10]
2D Example: arr[arr % 2 == 0]
13) flatten()
Converts any n-dimensional array into a 1D array (flat).
Very useful for reshaping tensors or matrices.
Always returns a copy.
Easy to feed into ML models (1D inputs).
Syntax: .flatten().
No parameters usually.
Great for data preprocessing.
1D Example: arr.flatten()
2D Example: arr.flatten()
14) ravel()
Similar to flatten() but returns a view if possible.
Faster than flatten for large data.
No copying unless needed.
Syntax: .ravel().
Best when you want memory-efficient flattening.
Use carefully if you plan to modify array later.
Important in advanced reshaping tasks.
1D Example: arr.ravel()
2D Example: arr.ravel()
15) transpose()
Swaps the axes of the array (rows become columns, and vice versa).
Syntax: .T or np.transpose(array).
Important for matrix algebra, neural networks.
Transpose is critical in linear algebra.
Easy with .T attribute.
Works even for higher dimensions.
You can specify axis orders.
1D Example: arr.T (same for 1D)
2D Example: arr.T
16) unique()
Returns the unique elements of an array.
Optionally, it can return counts too.
Syntax: np.unique(arr, return_counts=True).
Helps in category analysis.
Important for frequency-based tasks.
Used a lot in data mining.
It sorts the result automatically.
1D Example: np.unique(arr)
2D Example: np.unique(arr)
17) tile()
Repeats an array a number of times.
Syntax: np.tile(arr, reps).
Useful in feature engineering, data augmentation.
Repetitions are along different axes.
Creates bigger datasets artificially.
Great in simulations.
Shape must be adjusted accordingly.
1D Example: np.tile(arr, 3)
2D Example: np.tile(arr, (2, 3))
18) clip()
Restricts values to a certain range.
Syntax: np.clip(arr, min, max).
Very useful in preventing exploding values.
Common in deep learning (gradient clipping).
No change for values already in the range.
Clipped values are replaced accordingly.
Fast and vectorized.
1D Example: np.clip(arr, 0, 10)
2D Example: np.clip(arr, 1, 5)
19) repeat()
Repeats individual elements of an array.
Different from tile (tile repeats whole array).
Syntax: np.repeat(arr, repeats).
Great for expanding datasets.
Important in data science preprocessing.
Memory-efficient if used properly.
Can also specify axis.
1D Example: np.repeat(arr, 2)
2D Example: np.repeat(arr, 2, axis=1)
20) Axis-Specific Operations
o arr.sum(axis=0): Sum along columns.
o arr.sum(axis=1): Sum along rows.
21) Concatenation and Splitting:
o np.concatenate([arr1, arr2], axis=0): Joins arrays along an axis.
o np.split(arr, 2): Splits array into equal parts.
Concatenation means joining two or more arrays end-to-end along an existing axis
without changing their data.
In NumPy, we use the np.concatenate() function to concatenate arrays.
Arrays must have compatible shapes except in the dimension being concatenated.
You can concatenate arrays horizontally (axis=1), vertically (axis=0), or even along higher
axes for multidimensional arrays.
Concatenation is important for combining datasets, merging features, or stacking results
after processing.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = np.concatenate((a, b))
print(result) # Output: [1 2 3 4 5 6]
• Here, two 1D arrays are joined into one 1D array.
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
result = np.concatenate((a, b), axis=0)
print(result)
22. Resize in NumPy
Resize in NumPy means changing the shape (number of rows/columns) of an existing
array.
The np.resize() function is used to reshape an array to a new size.
• If the new size is larger, elements are repeated to fill the array.
• If the new size is smaller, the array is trimmed to fit. Resize is useful when you need
a specific number of elements for calculations, models, or visualization. It is
different from reshape() because reshape() does not repeat or cut elements — it
only changes shape if possible.
import numpy as np
a = np.array([1, 2, 3])
b = np.resize(a, (2, 4))
print(b)
Generating Random
NumPy’s random module provides functions to generate random numbers, which are
useful for simulations, testing, and machine learning.
Key Random Functions:
• np.random.rand(shape): Generates random floats in [0, 1) with specified shape.
• np.random.randn(shape): Generates random floats from a standard normal
distribution.
• np.random.randint(low, high, shape): Generates random integers in [low, high).
• np.random.choice(array, size): Samples random elements from an array.
• np.random.seed(seed): Sets a seed for reproducible results.
1) Generate Random Number
• In NumPy, numpy.random.randint() is used to generate random integers.
• It returns random integers from a specified low (inclusive) to high (exclusive) range.
• You can specify how many random numbers you want using the size parameter.
• It’s useful when you need whole numbers like 1, 5, 99, etc.
• The output can be a single integer or an array of integers.
Example:
import numpy as np; print(np.random.randint(1, 10)) # Random int between 1 and 9
2) Generate Random Float
• NumPy provides numpy.random.random() to generate random float numbers.
• It generates numbers between 0.0 (inclusive) and 1.0 (exclusive).
• It can produce a single float or an array of floats depending on the size argument.
• Useful in probability simulations or normalizing values.
• For other ranges, you can multiply the output (e.g., random() * 10 for 0–10).
Example:
import numpy as np; print(np.random.random()) # Random float between 0 and 1
3) Float (Convert or Create)
• astype(float) is used to convert a NumPy array to floating-point numbers.
• It’s helpful when you need decimal precision instead of integers.
• Floating-point arrays are often needed in scientific calculations.
• You can also directly create arrays of float type using dtype=float.
• This ensures all elements are treated as real numbers (decimals).
Example:
import numpy as np; print(np.array([1, 2, 3]).astype(float)) # Convert integers to floats
4) Generate Random Number from Array
• numpy.random.choice() picks random elements from a given array.
• It allows selecting one or multiple random elements with or without replacement.
• replace=False means no repetition, replace=True means repetition allowed.
• Useful for random sampling from a dataset.
• You can control the probability of selection using the p parameter.
Example:
import numpy as np; print(np.random.choice([10, 20, 30, 40])) # Randomly picks one
5) Random Permutation of Elements
• numpy.random.permutation() randomly rearranges elements of a sequence.
• It does not change the original array unless reassigned.
• For a 1D array, it just shuffles the elements randomly.
• No element is repeated; it’s simply reordered.
• Useful when you need random ordering without altering data content.
Example:
import numpy as np; print(np.random.permutation([1, 2, 3, 4])) # Random order
6) Random Permutation of Arrays
• numpy.random.permutation() can also be used on multi-dimensional arrays.
• It permutes along the first axis (rows) for 2D arrays.
• Columns inside rows remain unchanged, but rows are shuffled.
• It’s used to randomly reorder rows in a dataset (e.g., in ML datasets).
• The data within rows stays intact (only row positions change).
Example:
import numpy as np; print(np.random.permutation(np.array([[1,2],[3,4],[5,6]]))) # Shuffle
rows
Universal Functions (ufuncs)
Ufuncs are functions that operate element-wise on arrays, optimized for speed.
• Trigonometric: np.sin(arr), np.cos(arr), np.tan(arr).
• Exponential/Logarithmic: np.exp(arr), np.log(arr), np.log10(arr).
• Absolute: np.abs(arr).
• Square Root: np.sqrt(arr).
• Rounding: np.round(arr), np.floor(arr), np.ceil(arr).
Code Example: Ufuncs
import numpy as np
arr = np.array([1, 4, 9])
# Ufuncs
sqrt = np.sqrt(arr) # [1, 2, 3]
sin = np.sin(arr) # [0.841, -0.757, 0.412]
exp = np.exp(arr) # [2.718, 54.598, 8103.084]
print("Square Root:", sqrt)
Explanation:
• Applies square root to each element.
• Why Simple: Minimal code, shows a common ufunc.
Broadcasting in NumPy
What is Broadcasting?
Broadcasting is a technique in NumPy that allows arrays of different shapes to be used
together in arithmetic operations without manually replicating data.
Instead of copying smaller arrays to match larger arrays, NumPy virtually stretches them
to perform operations efficiently.
This leads to faster computations, less memory usage, and cleaner code compared to
traditional looping methods.
Broadcasting makes mathematical operations on arrays very flexible and is one of the
key reasons why NumPy is extremely fast for numerical computing.
Why is Broadcasting Needed?
Without broadcasting, you would need to manually expand arrays using methods like tile()
or writing for loops, which are slow and memory-consuming.
Broadcasting automatically adjusts arrays at runtime, allowing elegant and fast
computation between arrays of different but compatible shapes.
It is extremely useful in:
• Machine learning (adding biases, scaling inputs)
• Image processing (adding filters)
• Matrix algebra (row/column-wise operations)
• Scientific simulations (vectorized formulas)
Broadcasting Rules (Very Important)
When NumPy performs an operation between two arrays, it compares their shapes
element-wise, starting from the trailing dimensions (i.e., from the end):
Condition Action
If the dimensions are equal, or one of them is 1, No problem — broadcasting works!
If they are different and none is 1, Broadcasting fails → ValueError.
Steps NumPy follows:
1. Pad the smaller shape with 1s on the left if necessary.
2. Stretch the dimension where size is 1 to match the other array.
3. Operate element-by-element.
Example 1: Scalar with Array (Simple Broadcasting)
import numpy as np
a = np.array([1, 2, 3])
b=5
result = a + b
print(result)
Example 2: 1D Array with 2D Array
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
result = a + b
print(result)
Data Distribution
1. What is Data Distribution?
Data distribution describes how data values are spread or arranged across a dataset. It
helps us understand the pattern, tendency, and variability of data.
For example, in a class of students, if most students score between 70 and 90, the data is
clustered in that range. This clustering, spread, and shape form the basis of distribution.
3. Importance of Data Distribution
• Helps choose the right statistical methods and models.
• Affects results in hypothesis testing, regression, and ML.
• Useful in identifying outliers, trends, and patterns.
4. Measures Related to Distribution
• Central Tendency: Mean, Median, Mode.
• Spread: Range, Variance, Standard Deviation.
• Shape: Skewness, Kurtosis.