1.
Setting Up Your Environment
1. Install Python and Jupyter Notebook:
- Download Python from [python.org](https://www.python.org/).
- Install Jupyter Notebook using pip:
bash
pip install notebook
- Launch Jupyter Notebook:
bash
jupyter notebook
2. Install required libraries:
bash
pip install pandas sqlite3
_____________________________________________________
2. Introduction to Pandas
Pandas is a powerful library for data manipulation and analysis in Python.
2.1 Importing Pandas
python
import pandas as pd
2.2 Creating a DataFrame
A DataFrame is a 2D table-like structure for storing data.
python
Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
df = pd.DataFrame(data)
print(df)
2.3 Reading Data from a CSV File
python
Load a CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df.head()) Display the first 5 rows
2.4 Basic Data Manipulation
- Selecting Columns:
python
df['Name'] Select a single column
df[['Name', 'Age']] Select multiple columns
- Filtering Rows:
python
df[df['Age'] > 30] Filter rows where Age > 30
- Adding a New Column:
python
df['Salary'] = [50000, 60000, 70000]
print(df)
- Descriptive Statistics:
python
df.describe() Summary statistics for numerical columns
_____________________________________________________
3. Introduction to SQL in Jupyter Notebook
You can use SQL to query data directly in Jupyter Notebook using the `sqlite3` library or the `pandasql`
library.
3.1 Using SQLite with Pandas
python
import sqlite3
Create a connection to a SQLite database
conn = sqlite3.connect('example.db')
Load a DataFrame into a SQL table
df.to_sql('people', conn, if_exists='replace', index=False)
Query the database using SQL
query = "SELECT * FROM people WHERE Age > 30"
result = pd.read_sql(query, conn)
print(result)
3.2 Using `pandasql` for SQL Queries
Install `pandasql`:
bash
pip install pandasql
Use it in Jupyter Notebook:
python
from pandasql import sqldf
Define a query
query = "SELECT * FROM df WHERE Age > 30"
Execute the query
result = sqldf(query)
print(result)
_____________________________________________________
4. Combining Pandas and SQL
You can use SQL to query Pandas DataFrames directly.
4.1 Querying a DataFrame with SQL
python
from pandasql import sqldf
Example DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
df = pd.DataFrame(data)
SQL query
query = """
SELECT Name, Age
FROM df
WHERE City = 'New York'
"""
Execute the query
result = sqldf(query)
print(result)
5. Practice Exercises
1. Create a DataFrame with 5 rows and 4 columns (e.g., Name, Age, City, Salary).
2. Filter rows where the Salary is greater than 50,000.
3. Add a new column called `Bonus` that is 10% of the Salary.
4. Use SQL to query the DataFrame and find people older than 30.
_________________________________________________________________________
1. Advanced Pandas Functions
1.1 Handling Missing Data
- Check for Missing Values:
python
df.isnull().sum() Count missing values per column
- Drop Missing Values:
python
df.dropna() Drop rows with missing values
df.dropna(axis=1) Drop columns with missing values
- Fill Missing Values:
python
df.fillna(0) Fill missing values with 0
df['Age'].fillna(df['Age'].mean(), inplace=True) Fill with mean
1.2 Grouping and Aggregation
- Group by a Column:
python
df.groupby('City')['Age'].mean() Average age by city
- Aggregate Functions:
python
df.groupby('City').agg({
'Age': ['mean', 'min', 'max'],
'Salary': 'sum'
})
1.3 Merging and Joining DataFrames
- Merge Two DataFrames:
python
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [50000, 60000, 70000]})
merged_df = pd.merge(df1, df2, on='ID', how='inner') Inner join
print(merged_df)
- Concatenate DataFrames:
python
concatenated_df = pd.concat([df1, df2], axis=0) Stack vertically
print(concatenated_df)
1.4 Pivot Tables
- Create a Pivot Table:
python
pivot_table = df.pivot_table(values='Salary', index='City', columns='Name', aggfunc='mean')
print(pivot_table)
1.5 Apply Functions
- Apply a Function to a Column:
python
df['Age'] = df['Age'].apply(lambda x: x + 1) Increment age by 1
- Apply a Function Row-wise:
python
df['Age_Salary_Ratio'] = df.apply(lambda row: row['Age'] / row['Salary'], axis=1)
1.6 Sorting and Ranking
- Sort by a Column:
python
df.sort_values(by='Salary', ascending=False, inplace=True)
- Rank Data:
python
df['Rank'] = df['Salary'].rank(ascending=False)
_____________________________________________________
2. Advanced SQL Queries
2.1 Basic SQL Queries
- Select with Conditions:
sql
SELECT * FROM people WHERE Age > 30 AND City = 'New York';
- Order By:
sql
SELECT * FROM people ORDER BY Salary DESC;
2.2 Aggregation in SQL
- Group By:
sql
SELECT City, AVG(Age) AS AvgAge FROM people GROUP BY City;
- Having Clause:
sql
SELECT City, AVG(Salary) AS AvgSalary
FROM people
GROUP BY City
HAVING AVG(Salary) > 50000;
2.3 Joins in SQL
- Inner Join:
sql
SELECT df1.Name, df2.Salary
FROM df1
INNER JOIN df2 ON df1.ID = df2.ID;
- Left Join:
sql
SELECT df1.Name, df2.Salary
FROM df1
LEFT JOIN df2 ON df1.ID = df2.ID;
2.4 Subqueries
- Subquery in WHERE Clause:
sql
SELECT Name, Salary
FROM people
WHERE Salary > (SELECT AVG(Salary) FROM people);
- Subquery in SELECT Clause:
sql
SELECT Name, (SELECT AVG(Salary) FROM people) AS AvgSalary
FROM people;
2.5 Window Functions
- Row Number:
sql
SELECT Name, Salary, ROW_NUMBER() OVER (ORDER BY Salary DESC) AS Rank
FROM people;
- Rank:
sql
SELECT Name, Salary, RANK() OVER (ORDER BY Salary DESC) AS Rank
FROM people;
3. Combining Pandas and SQL for Analysis
3.1 Querying DataFrames with SQL
python
from pandasql import sqldf
Example DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
df = pd.DataFrame(data)
SQL query
query = """
SELECT Name, Age
FROM df
WHERE City = 'New York'
"""
Execute the query
result = sqldf(query)
print(result)
3.2 Exporting Data to SQL
python
from sqlalchemy import create_engine
Create a SQLite database connection
engine = create_engine('sqlite:///example.db')
Export DataFrame to SQL table
df.to_sql('people', engine, if_exists='replace', index=False)
3.3 Reading Data from SQL
python
Read data from SQL table into a DataFrame
query = "SELECT * FROM people"
df_from_sql = pd.read_sql(query, engine)
print(df_from_sql)
4. Practice Exercises
1. Load a CSV file into a DataFrame and clean it by handling missing values.
2. Group the data by a categorical column and calculate summary statistics.
3. Perform an inner join on two DataFrames using both Pandas and SQL.
4. Use a window function in SQL to rank rows based on a numeric column.
5. Export a DataFrame to a SQL table and query it using SQL.