KEMBAR78
Practical | PDF | Sql | Data Management
0% found this document useful (0 votes)
28 views12 pages

Practical

This document provides a comprehensive guide on setting up Python and Jupyter Notebook for data manipulation using Pandas and SQL. It covers installation, creating DataFrames, basic and advanced data manipulation techniques, and querying data with SQL in Jupyter Notebook. Additionally, it includes practice exercises to reinforce learning.

Uploaded by

eczhyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views12 pages

Practical

This document provides a comprehensive guide on setting up Python and Jupyter Notebook for data manipulation using Pandas and SQL. It covers installation, creating DataFrames, basic and advanced data manipulation techniques, and querying data with SQL in Jupyter Notebook. Additionally, it includes practice exercises to reinforce learning.

Uploaded by

eczhyena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1.

Setting Up Your Environment

1. Install Python and Jupyter Notebook:

- Download Python from [python.org](https://www.python.org/).

- Install Jupyter Notebook using pip:

bash

pip install notebook

- Launch Jupyter Notebook:

bash

jupyter notebook

2. Install required libraries:

bash

pip install pandas sqlite3

_____________________________________________________

2. Introduction to Pandas

Pandas is a powerful library for data manipulation and analysis in Python.

2.1 Importing Pandas

python

import pandas as pd

2.2 Creating a DataFrame

A DataFrame is a 2D table-like structure for storing data.


python

Create a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

print(df)

2.3 Reading Data from a CSV File

python

Load a CSV file into a DataFrame

df = pd.read_csv('data.csv')

print(df.head()) Display the first 5 rows

2.4 Basic Data Manipulation

- Selecting Columns:

python

df['Name'] Select a single column

df[['Name', 'Age']] Select multiple columns

- Filtering Rows:

python

df[df['Age'] > 30] Filter rows where Age > 30


- Adding a New Column:

python

df['Salary'] = [50000, 60000, 70000]

print(df)

- Descriptive Statistics:

python

df.describe() Summary statistics for numerical columns

_____________________________________________________

3. Introduction to SQL in Jupyter Notebook

You can use SQL to query data directly in Jupyter Notebook using the `sqlite3` library or the `pandasql`
library.

3.1 Using SQLite with Pandas

python

import sqlite3

Create a connection to a SQLite database

conn = sqlite3.connect('example.db')

Load a DataFrame into a SQL table

df.to_sql('people', conn, if_exists='replace', index=False)

Query the database using SQL


query = "SELECT * FROM people WHERE Age > 30"

result = pd.read_sql(query, conn)

print(result)

3.2 Using `pandasql` for SQL Queries

Install `pandasql`:

bash

pip install pandasql

Use it in Jupyter Notebook:

python

from pandasql import sqldf

Define a query

query = "SELECT * FROM df WHERE Age > 30"

Execute the query

result = sqldf(query)

print(result)

_____________________________________________________

4. Combining Pandas and SQL

You can use SQL to query Pandas DataFrames directly.

4.1 Querying a DataFrame with SQL

python
from pandasql import sqldf

Example DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

SQL query

query = """

SELECT Name, Age

FROM df

WHERE City = 'New York'

"""

Execute the query

result = sqldf(query)

print(result)

5. Practice Exercises

1. Create a DataFrame with 5 rows and 4 columns (e.g., Name, Age, City, Salary).

2. Filter rows where the Salary is greater than 50,000.

3. Add a new column called `Bonus` that is 10% of the Salary.

4. Use SQL to query the DataFrame and find people older than 30.

_________________________________________________________________________
1. Advanced Pandas Functions

1.1 Handling Missing Data

- Check for Missing Values:

python

df.isnull().sum() Count missing values per column

- Drop Missing Values:

python

df.dropna() Drop rows with missing values

df.dropna(axis=1) Drop columns with missing values

- Fill Missing Values:

python

df.fillna(0) Fill missing values with 0

df['Age'].fillna(df['Age'].mean(), inplace=True) Fill with mean

1.2 Grouping and Aggregation

- Group by a Column:

python

df.groupby('City')['Age'].mean() Average age by city

- Aggregate Functions:

python

df.groupby('City').agg({
'Age': ['mean', 'min', 'max'],

'Salary': 'sum'

})

1.3 Merging and Joining DataFrames

- Merge Two DataFrames:

python

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [50000, 60000, 70000]})

merged_df = pd.merge(df1, df2, on='ID', how='inner') Inner join

print(merged_df)

- Concatenate DataFrames:

python

concatenated_df = pd.concat([df1, df2], axis=0) Stack vertically

print(concatenated_df)

1.4 Pivot Tables

- Create a Pivot Table:

python

pivot_table = df.pivot_table(values='Salary', index='City', columns='Name', aggfunc='mean')

print(pivot_table)
1.5 Apply Functions

- Apply a Function to a Column:

python

df['Age'] = df['Age'].apply(lambda x: x + 1) Increment age by 1

- Apply a Function Row-wise:

python

df['Age_Salary_Ratio'] = df.apply(lambda row: row['Age'] / row['Salary'], axis=1)

1.6 Sorting and Ranking

- Sort by a Column:

python

df.sort_values(by='Salary', ascending=False, inplace=True)

- Rank Data:

python

df['Rank'] = df['Salary'].rank(ascending=False)

_____________________________________________________

2. Advanced SQL Queries

2.1 Basic SQL Queries

- Select with Conditions:

sql

SELECT * FROM people WHERE Age > 30 AND City = 'New York';
- Order By:

sql

SELECT * FROM people ORDER BY Salary DESC;

2.2 Aggregation in SQL

- Group By:

sql

SELECT City, AVG(Age) AS AvgAge FROM people GROUP BY City;

- Having Clause:

sql

SELECT City, AVG(Salary) AS AvgSalary

FROM people

GROUP BY City

HAVING AVG(Salary) > 50000;

2.3 Joins in SQL

- Inner Join:

sql

SELECT df1.Name, df2.Salary

FROM df1

INNER JOIN df2 ON df1.ID = df2.ID;

- Left Join:

sql
SELECT df1.Name, df2.Salary

FROM df1

LEFT JOIN df2 ON df1.ID = df2.ID;

2.4 Subqueries

- Subquery in WHERE Clause:

sql

SELECT Name, Salary

FROM people

WHERE Salary > (SELECT AVG(Salary) FROM people);

- Subquery in SELECT Clause:

sql

SELECT Name, (SELECT AVG(Salary) FROM people) AS AvgSalary

FROM people;

2.5 Window Functions

- Row Number:

sql

SELECT Name, Salary, ROW_NUMBER() OVER (ORDER BY Salary DESC) AS Rank

FROM people;

- Rank:

sql

SELECT Name, Salary, RANK() OVER (ORDER BY Salary DESC) AS Rank


FROM people;

3. Combining Pandas and SQL for Analysis

3.1 Querying DataFrames with SQL

python

from pandasql import sqldf

Example DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame(data)

SQL query

query = """

SELECT Name, Age

FROM df

WHERE City = 'New York'

"""

Execute the query

result = sqldf(query)

print(result)
3.2 Exporting Data to SQL

python

from sqlalchemy import create_engine

Create a SQLite database connection

engine = create_engine('sqlite:///example.db')

Export DataFrame to SQL table

df.to_sql('people', engine, if_exists='replace', index=False)

3.3 Reading Data from SQL

python

Read data from SQL table into a DataFrame

query = "SELECT * FROM people"

df_from_sql = pd.read_sql(query, engine)

print(df_from_sql)

4. Practice Exercises

1. Load a CSV file into a DataFrame and clean it by handling missing values.

2. Group the data by a categorical column and calculate summary statistics.

3. Perform an inner join on two DataFrames using both Pandas and SQL.

4. Use a window function in SQL to rank rows based on a numeric column.

5. Export a DataFrame to a SQL table and query it using SQL.

You might also like