KEMBAR78
Python & MySQL For Data Analysis | PDF | Databases | Table (Database)
0% found this document useful (0 votes)
20 views45 pages

Python & MySQL For Data Analysis

The document provides an overview of using Python and MySQL for data analysis, highlighting the importance of Python's libraries such as Pandas and Matplotlib for data manipulation and visualization. It includes detailed steps for setting up the Python environment, installing necessary libraries, and connecting to MySQL. Additionally, the document covers basic data manipulation techniques using Pandas, including reading CSV files, filtering data, handling missing values, and performing group by operations.

Uploaded by

mishrarajeev797
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Python & MySQL For Data Analysis

The document provides an overview of using Python and MySQL for data analysis, highlighting the importance of Python's libraries such as Pandas and Matplotlib for data manipulation and visualization. It includes detailed steps for setting up the Python environment, installing necessary libraries, and connecting to MySQL. Additionally, the document covers basic data manipulation techniques using Pandas, including reading CSV files, filtering data, handling missing values, and performing group by operations.

Uploaded by

mishrarajeev797
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

PYTHON & MYSQL FOR DATA ANALYSIS

INTRODUCTION TO PYTHON PROGRAMMING


Python is a versatile and powerful programming language that has gained
immense popularity in recent years, particularly in the fields of data analysis
and scientific computing. Its simple and readable syntax makes it an ideal
choice for beginners, while its extensive libraries and frameworks provide
advanced capabilities for experienced programmers. Python's significance in
data analysis stems from its ability to handle large datasets, perform complex
computations, and visualize data effectively, making it a valuable tool for
researchers, analysts, and decision-makers alike.

One of the key libraries that enhances Python's data manipulation capabilities
is Pandas. This open-source library provides data structures and functions
designed to facilitate the manipulation and analysis of structured data. With
its powerful DataFrame object, Pandas allows users to easily read, filter, and
aggregate data, making it a staple for data analysts. The library also
integrates seamlessly with other Python libraries, enabling users to perform
complex data operations with minimal effort.

Another essential library for data visualization in Python is Matplotlib. This


library provides a comprehensive suite of tools for creating static, animated,
and interactive visualizations in Python. With Matplotlib, users can generate a
wide array of plots and charts, including line graphs, bar charts, histograms,
and scatter plots, which are crucial for illustrating trends and patterns in data.
The ability to customize visual outputs allows analysts to present their
findings in a clear and compelling manner.

In summary, Python programming, complemented by powerful libraries like


Pandas and Matplotlib, offers an effective framework for conducting data
analysis. This practical file will delve deeper into these tools, equipping users
with the skills to harness Python's full potential for data-driven decision-
making.

SETTING UP THE ENVIRONMENT


Setting up a Python environment is a crucial step in preparing for data
analysis. The process involves installing Python, along with necessary libraries
such as Pandas and Matplotlib, and establishing a connection to MySQL for
database management. Below is a guide to help you set up your environment
efficiently.

STEP 1: INSTALL ANACONDA OR PYTHON

Anaconda is a popular distribution that simplifies package management and


deployment. It includes Python, and several key libraries. To install Anaconda:

1. Visit the Anaconda website.


2. Download the installer for your operating system.
3. Follow the installation instructions provided.

If you prefer to install Python separately, download it from the official Python
website and follow the installation prompts.

STEP 2: INSTALL NECESSARY LIBRARIES

Once Anaconda or Python is installed, you can install Pandas and Matplotlib
using pip (Python’s package installer) or through Anaconda Navigator.

Using pip:

pip install pandas matplotlib

Using Anaconda:

1. Open Anaconda Navigator.


2. Go to the 'Environments' tab and select your environment.
3. Search for 'pandas' and 'matplotlib' and click 'Apply' to install.

STEP 3: INSTALL MYSQL

To work with MySQL, you need to install the server and the client. Follow
these steps:

1. Download MySQL from the MySQL website.


2. Follow the installation instructions for your operating system.
3. During installation, note down the root password as it will be required
later.
STEP 4: CONNECT PYTHON TO MYSQL

To connect Python with MySQL, you’ll need to install the MySQL Connector
library. You can do this using pip:

pip install mysql-connector-python

STEP 5: ESTABLISH A CONNECTION

Once installed, you can establish a connection to MySQL using the following
code snippet:

import mysql.connector

connection = mysql.connector.connect(
host='localhost',
user='your_username',
password='your_password',
database='your_database'
)

if connection.is_connected():
print("Successfully connected to the database")

This setup provides a solid foundation for data analysis using Python,
enabling you to manipulate data with Pandas, visualize it with Matplotlib, and
manage it through MySQL.

DATA MANIPULATION WITH PANDAS


Pandas is an essential library for data manipulation in Python, providing
powerful tools for data analysis through its DataFrame and Series objects.
Understanding basic operations in Pandas is crucial for any data analyst.

READING CSV FILES

One of the most common tasks in data analysis is to read data from CSV files.
Pandas makes this easy with the read_csv() function. For example:
import pandas as pd

data = pd.read_csv('data.csv')

This command reads the data from a specified CSV file and stores it in a
DataFrame named data . The DataFrame is a 2-dimensional labeled data
structure, similar to a spreadsheet, which allows for easy manipulation and
analysis.

CREATING DATAFRAMES

In addition to reading data from files, you can create DataFrames directly
from dictionaries or lists. For instance:

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

This creates a DataFrame containing names, ages, and cities.

SELECTING AND FILTERING DATA

Pandas allows for powerful data selection and filtering capabilities. You can
select specific columns or rows using indexing. For example, to select the
'Name' column:

names = df['Name']

To filter the DataFrame based on a condition, such as finding all individuals


aged over 30:

filtered_data = df[df['Age'] > 30]


HANDLING MISSING VALUES

Missing values can pose challenges in data analysis. Pandas provides


functions such as isnull() and dropna() for handling these values. To
check for missing values:

missing = df.isnull().sum()

To remove rows with missing values, you can use:

cleaned_data = df.dropna()

PERFORMING GROUP BY OPERATIONS

Group by operations are essential for aggregating data based on categories.


The groupby() function allows you to group data and apply aggregation
functions. For example, to calculate the average age by city:

average_age = df.groupby('City')['Age'].mean()

This command groups the data by the 'City' column and computes the mean
of the 'Age' column for each group.

These basic operations are fundamental for effective data manipulation in


Pandas, setting the stage for more complex analyses and insights.

PROGRAM 1: BASIC DATAFRAME OPERATIONS


To demonstrate basic DataFrame creation and manipulation using Pandas,
let’s start by creating a sample DataFrame and performing some common
operations. Below is a Python program that illustrates these concepts.

SAMPLE DATA

Let's assume we have the following input data representing employees in a


company:
data = {
'EmployeeID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [28, 34, 29, 45],
'Department': ['HR', 'IT', 'Finance', 'IT']
}

CREATING A DATAFRAME

We will create a DataFrame using the sample data:

import pandas as pd

# Creating DataFrame
df = pd.DataFrame(data)

# Displaying the DataFrame


print("Initial DataFrame:")
print(df)

OUTPUT EXAMPLE

When you run the above code, you'll get the initial DataFrame displayed as
follows:

Initial DataFrame:
EmployeeID Name Age Department
0 101 Alice 28 HR
1 102 Bob 34 IT
2 103 Charlie 29 Finance
3 104 David 45 IT

SELECTING COLUMNS

Next, let’s select the 'Name' and 'Age' columns:


selected_columns = df[['Name', 'Age']]
print("\nSelected Columns (Name and Age):")
print(selected_columns)

OUTPUT EXAMPLE

The output will display:

Selected Columns (Name and Age):


Name Age
0 Alice 28
1 Bob 34
2 Charlie 29
3 David 45

FILTERING ROWS

We can filter employees who are older than 30:

filtered_employees = df[df['Age'] > 30]


print("\nEmployees Older Than 30:")
print(filtered_employees)

OUTPUT EXAMPLE

The output will show:

Employees Older Than 30:


EmployeeID Name Age Department
1 102 Bob 34 IT
3 104 David 45 IT

ADDING A NEW COLUMN

We can add a new column to indicate if the employee is over 30:


df['Over_30'] = df['Age'] > 30
print("\nDataFrame with New Column 'Over_30':")
print(df)

OUTPUT EXAMPLE

The updated DataFrame will look like this:

DataFrame with New Column 'Over_30':


EmployeeID Name Age Department Over_30
0 101 Alice 28 HR False
1 102 Bob 34 IT True
2 103 Charlie 29 Finance False
3 104 David 45 IT True

CONCLUSION

This program demonstrates basic DataFrame operations in Pandas, including


DataFrame creation, selection of columns, filtering of rows, and the addition
of new columns. Each operation contributes to a more comprehensive
understanding of how to manipulate data effectively using Pandas.

PROGRAM 2: DATA FILTERING


Data filtering is a crucial aspect of data analysis, allowing analysts to extract
meaningful insights from large datasets. In this section, we will develop a
program that filters data within a DataFrame based on specific conditions and
outputs the results. We will utilize the Pandas library for this task, leveraging
its powerful filtering capabilities.

SAMPLE DATA

For our example, let's consider a dataset containing information about


various products in a store. The dataset includes the following columns:
ProductID , ProductName , Category , Price , and Stock .

data = {
'ProductID': [1, 2, 3, 4, 5],
'ProductName': ['Laptop', 'Mouse', 'Keyboard',
'Monitor', 'Printer'],
'Category': ['Electronics', 'Accessories',
'Accessories', 'Electronics', 'Office'],
'Price': [1200, 25, 45, 300, 150],
'Stock': [50, 200, 150, 100, 80]
}

CREATING THE DATAFRAME

We will first create a DataFrame using this sample data:

import pandas as pd

# Creating the DataFrame


df = pd.DataFrame(data)
print("Initial Product DataFrame:")
print(df)

FILTERING DATA

Next, we'll filter the DataFrame to find products that belong to the
Electronics category and have a price greater than $200. This filtering
allows us to focus on higher-end electronic items.

filtered_products = df[(df['Category'] == 'Electronics')


& (df['Price'] > 200)]
print("\nFiltered Products (Electronics and Price >
200):")
print(filtered_products)

OUTPUT EXAMPLE

When you run the above code, the output will display the filtered DataFrame:

Filtered Products (Electronics and Price > 200):


ProductID ProductName Category Price Stock
0 1 Laptop Electronics 1200 50
3 4 Monitor Electronics 300 100

FURTHER FILTERING

Additionally, we can perform more complex filtering, such as finding products


that are either Electronics or Accessories with stock levels greater
than 100. This will help in identifying products that are readily available for
sale.

further_filtered_products = df[((df['Category'] ==
'Electronics') | (df['Category'] == 'Accessories')) &
(df['Stock'] > 100)]
print("\nFurther Filtered Products (Electronics or
Accessories with Stock > 100):")
print(further_filtered_products)

OUTPUT EXAMPLE

The output will show:

Further Filtered Products (Electronics or Accessories


with Stock > 100):
ProductID ProductName Category Price Stock
1 2 Mouse Accessories 25 200
2 3 Keyboard Accessories 45 150

CONCLUSION

This program illustrates how to filter data in a DataFrame using specific


conditions with Pandas. By utilizing logical operators and conditions, we can
extract precise subsets of data, enabling us to conduct more focused
analyses. Data filtering is an essential skill for any data analyst, as it allows for
the identification of trends and patterns that are crucial for informed
decision-making.
PROGRAM 3: HANDLING MISSING VALUES
Handling missing values is a critical step in data preprocessing, as it can
significantly impact the outcomes of data analysis and modeling. In this
program, we will explore various techniques to fill or manage missing values
in a DataFrame using the Pandas library. We will demonstrate methods such
as forward fill, backward fill, and filling with specific values or statistical
measures.

SAMPLE DATA

Let’s create a sample dataset that includes some missing values to illustrate
our techniques. Our dataset will consist of information about students,
including their Name , Age , and Score .

import pandas as pd
import numpy as np

# Sample data with missing values


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [23, np.nan, 21, 24, np.nan],
'Score': [85, 90, np.nan, 70, 95]
}

df = pd.DataFrame(data)
print("Initial DataFrame with Missing Values:")
print(df)

IDENTIFYING MISSING VALUES

Before we can handle missing values, we need to identify their locations. We


can use the isnull() method to check for missing values:

missing_values = df.isnull().sum()
print("\nMissing Values Count:")
print(missing_values)
FILLING MISSING VALUES

1. Forward Fill: This method replaces missing values with the last valid
observation.

df_ffill = df.fillna(method='ffill')
print("\nDataFrame after Forward Fill:")
print(df_ffill)

1. Backward Fill: This technique fills missing values with the next valid
observation.

df_bfill = df.fillna(method='bfill')
print("\nDataFrame after Backward Fill:")
print(df_bfill)

1. Fill with a Specific Value: We can also fill missing values with a specific
constant, such as 0 or any other value relevant to our analysis.

df_fill_zero = df.fillna(0)
print("\nDataFrame after Filling with Zero:")
print(df_fill_zero)

1. Fill with Mean/Median: For numerical columns, filling missing values


with the mean or median can be a good strategy.

mean_age = df['Age'].mean()
df_fill_mean = df.fillna({'Age': mean_age})
print("\nDataFrame after Filling Missing Age with Mean:")
print(df_fill_mean)

CONCLUSION OF HANDLING MISSING VALUES

In this program, we demonstrated several techniques for handling missing


values in a DataFrame using Pandas. By filling missing data appropriately, we
can ensure that our analysis remains robust and accurate. Each method has
its use cases, and the choice of technique depends on the context of the data
and the analysis requirements.

PROGRAM 4: GROUP BY OPERATIONS


The group_by function in Pandas is a powerful tool for aggregating data
based on specific categories. This functionality allows data analysts to extract
meaningful insights by summarizing data across multiple dimensions. In this
section, we will illustrate how to use the groupby() function to perform
aggregations and display the results effectively.

SAMPLE DATA

To demonstrate group by operations, let's create a sample dataset


representing sales transactions within a retail store. The dataset will have the
following columns: TransactionID , Product , Category ,
SalesAmount , and Quantity .

import pandas as pd

data = {
'TransactionID': [1, 2, 3, 4, 5, 6],
'Product': ['Apple', 'Banana', 'Orange', 'Apple',
'Banana', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit',
'Fruit', 'Fruit'],
'SalesAmount': [100, 150, 200, 120, 160, 210],
'Quantity': [10, 15, 20, 12, 18, 25]
}

df = pd.DataFrame(data)
print("Initial Sales DataFrame:")
print(df)

GROUPING DATA

To analyze total sales and quantities sold for each product, we can group the
data by Product and then apply aggregation functions like sum() to
calculate the total SalesAmount and Quantity .
grouped_data = df.groupby('Product').agg({'SalesAmount':
'sum', 'Quantity': 'sum'}).reset_index()
print("\nGrouped Sales Data by Product:")
print(grouped_data)

OUTPUT EXAMPLE

The output will display the total sales and quantities for each product:

Grouped Sales Data by Product:


Product SalesAmount Quantity
0 Apple 220 22
1 Banana 310 33
2 Orange 410 45

ADDITIONAL AGGREGATIONS

The groupby() function can also compute multiple aggregation functions


for different columns simultaneously. For instance, we can calculate both the
total sales and the average quantity sold per product:

additional_grouped_data =
df.groupby('Product').agg({'SalesAmount': ['sum',
'mean'], 'Quantity': ['sum', 'mean']}).reset_index()
print("\nGrouped Sales Data with Multiple Aggregations:")
print(additional_grouped_data)

OUTPUT EXAMPLE

The output will show both the total and average values:

Grouped Sales Data with Multiple Aggregations:


Product SalesAmount Quantity
sum mean sum mean
0 Apple 220 110.0 22 11.0
1 Banana 310 155.0 33 16.5
2 Orange 410 205.0 45 22.5
FILTERING GROUPED RESULTS

After grouping and aggregating data, you may want to filter the results based
on specific criteria. For example, let’s say we only want to see products where
the total sales are greater than $250:

filtered_grouped_data =
grouped_data[grouped_data['SalesAmount'] > 250]
print("\nFiltered Grouped Sales Data (SalesAmount >
250):")
print(filtered_grouped_data)

OUTPUT EXAMPLE

The output will display only the filtered results:

Filtered Grouped Sales Data (SalesAmount > 250):


Product SalesAmount Quantity
1 Banana 310 33
2 Orange 410 45

Using the groupby() function in Pandas provides a flexible and efficient


way to perform data aggregation and analysis. By summarizing data based
on categories, analysts can gain insights that are crucial for decision-making
and strategic planning.

PROGRAM 5: DATA VISUALIZATION WITH


MATPLOTLIB
Data visualization is an essential component of data analysis, enabling
analysts to present complex data in a more understandable and visually
appealing manner. One of the most powerful libraries for creating static,
animated, and interactive visualizations in Python is Matplotlib. In this
program, we will explore how to use Matplotlib to generate various types of
plots—specifically line plots, bar charts, and histograms—using a DataFrame.
IMPORTING LIBRARIES

Before we start visualizing data, we need to import the necessary libraries.


Ensure you have both Pandas and Matplotlib installed. If not, you can install
them using pip .

import pandas as pd
import matplotlib.pyplot as plt

SAMPLE DATA

For our visualization examples, let’s create a simple DataFrame containing


sales data for different products over a period of time.

data = {
'Month': ['January', 'February', 'March', 'April',
'May', 'June'],
'Sales_A': [200, 300, 250, 400, 350, 450],
'Sales_B': [150, 250, 300, 200, 500, 600]
}

df = pd.DataFrame(data)
print("Sales Data:")
print(df)

LINE PLOT

Line plots are ideal for visualizing trends over time. We can create a line plot
to show the sales performance of two products (Sales_A and Sales_B) across
the months.

plt.figure(figsize=(10, 5))
plt.plot(df['Month'], df['Sales_A'], marker='o',
label='Product A', color='blue')
plt.plot(df['Month'], df['Sales_B'], marker='o',
label='Product B', color='orange')
plt.title('Monthly Sales for Products A and B')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend()
plt.grid()
plt.show()

BAR CHART

A bar chart is useful for comparing quantities across different categories. We


can create a bar chart to compare the total sales of both products in a single
month.

# Bar chart for the latest month


plt.figure(figsize=(8, 5))
bar_width = 0.35
x = range(len(df['Month']))

plt.bar(x, df['Sales_A'], width=bar_width, label='Product


A', color='blue')
plt.bar([p + bar_width for p in x], df['Sales_B'],
width=bar_width, label='Product B', color='orange')

plt.title('Sales Comparison for January to June')


plt.xlabel('Month')
plt.ylabel('Sales')
plt.xticks([p + bar_width / 2 for p in x], df['Month'])
plt.legend()
plt.show()

HISTOGRAM

Histograms are useful for understanding the distribution of numerical data.


We can visualize the distribution of sales data for Product A.

plt.figure(figsize=(8, 5))
plt.hist(df['Sales_A'], bins=5, color='blue', alpha=0.7,
edgecolor='black')
plt.title('Distribution of Sales for Product A')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.grid()
plt.show()

CONCLUSION

This program demonstrates how to utilize Matplotlib to create a variety of


plots—line plots, bar charts, and histograms—based on DataFrame data.
These visualizations help in understanding trends, comparisons, and
distributions, which are critical for effective data analysis and communication
of findings.

PROGRAM 6: COMBINING DATAFRAMES


Combining DataFrames is a fundamental aspect of data manipulation in
Pandas, allowing analysts to merge or concatenate datasets to derive more
comprehensive insights. There are two primary methods for combining
DataFrames: concatenation and merging. Each method serves a distinct
purpose and has its own use cases.

CONCATENATION

Concatenation is the process of appending DataFrames along a particular


axis, either vertically (stacking rows) or horizontally (stacking columns). The
pd.concat() function is used for this purpose.

Example of Concatenating DataFrames

Consider two DataFrames containing sales data for different quarters:

import pandas as pd

# DataFrames for Q1 and Q2


data_q1 = {
'Product': ['A', 'B', 'C'],
'Sales': [150, 200, 250]
}
data_q2 = {
'Product': ['A', 'B', 'C'],
'Sales': [180, 220, 270]
}
df_q1 = pd.DataFrame(data_q1)
df_q2 = pd.DataFrame(data_q2)

# Concatenating DataFrames
df_combined = pd.concat([df_q1, df_q2],
ignore_index=True)
print("Concatenated DataFrame:")
print(df_combined)

The output will show a single DataFrame containing the sales data from both
quarters, stacked vertically.

MERGING

Merging, on the other hand, is used to combine DataFrames based on


common columns or indices. This is akin to SQL joins, where you can specify
how to align rows from different DataFrames based on shared keys. The
pd.merge() function facilitates this process.

Example of Merging DataFrames

Assume we have two DataFrames: one containing product information and


another containing sales data.

# Product DataFrame
data_products = {
'ProductID': [1, 2, 3],
'Product': ['A', 'B', 'C']
}
df_products = pd.DataFrame(data_products)

# Sales DataFrame
data_sales = {
'ProductID': [1, 2, 1],
'Sales': [150, 200, 180]
}
df_sales = pd.DataFrame(data_sales)

# Merging DataFrames on 'ProductID'


df_merged = pd.merge(df_products, df_sales,
on='ProductID')
print("\nMerged DataFrame:")
print(df_merged)

This will generate a DataFrame that includes product names alongside their
corresponding sales figures, effectively integrating data from both sources.

KEY DIFFERENCES

• Concatenation is primarily used when you want to stack DataFrames


either vertically or horizontally without considering the relationships
between them, while merging is utilized to combine DataFrames based
on common keys, aligning data that is related.

• Concatenation results in a larger DataFrame with additional rows or


columns, whereas merging produces a new DataFrame that relates
records based on shared attributes.

Understanding these methods for combining DataFrames is crucial for


effective data analysis, enabling analysts to work with comprehensive
datasets that provide deeper insights into their data.

PROGRAM 7: EXPORTING DATA TO CSV


Exporting data to CSV (Comma-Separated Values) format is a common
requirement in data analysis, allowing users to save manipulated datasets for
further analysis or sharing with others. In this program, we will demonstrate
how to export a Pandas DataFrame to a CSV file after performing some data
manipulations.

SAMPLE DATAFRAME CREATION

Let's begin by creating a sample DataFrame that we will manipulate and then
export. For this example, we will create a simple dataset representing
employee information.

import pandas as pd

# Sample employee data


data = {
'EmployeeID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [28, 34, 29, 45],
'Department': ['HR', 'IT', 'Finance', 'IT']
}

df = pd.DataFrame(data)
print("Initial Employee DataFrame:")
print(df)

DATA MANIPULATION

Before we export the DataFrame, we might want to perform some


manipulations, such as filtering out employees older than 30 and adding a
new column that indicates if they are senior employees.

# Filtering employees older than 30


filtered_df = df[df['Age'] > 30]

# Adding a new column to indicate senior status


filtered_df['Senior'] = filtered_df['Age'] > 40
print("\nFiltered Employee DataFrame:")
print(filtered_df)

EXPORTING TO CSV

Now that we have our manipulated DataFrame, we can export it to a CSV file
using the to_csv() method provided by Pandas. We will specify the name
of the file and ensure that the index is not included in the output file.

# Exporting the DataFrame to a CSV file


filtered_df.to_csv('filtered_employees.csv', index=False)
print("\nFiltered employee data exported to
'filtered_employees.csv'.")

READING THE EXPORTED CSV

To ensure that our data has been exported correctly, we can read the newly
created CSV file back into a DataFrame and display its contents.
# Reading the exported CSV file
import os

if os.path.exists('filtered_employees.csv'):
exported_df = pd.read_csv('filtered_employees.csv')
print("\nData read from 'filtered_employees.csv':")
print(exported_df)

CONCLUSION

In this program, we demonstrated how to create a DataFrame, perform some


basic manipulations, and export the resulting DataFrame to a CSV file using
Pandas. This process is essential for data analysts who need to save their
work for future analysis or share insights with others. The ability to easily
export data is one of the many powerful features of the Pandas library,
making it an invaluable tool in data analysis workflows.

PROGRAM 8: TIME SERIES ANALYSIS


Time series analysis is a crucial technique used to analyze data points
collected or recorded at specific time intervals. With the rise of data-driven
decision-making, understanding how to manipulate and analyze time series
data has become increasingly important. In this program, we will illustrate
how to perform time series analysis using the Pandas library, including date-
time indexing and basic operations.

IMPORTING LIBRARIES

To get started, we need to import the required libraries. Ensure you have
Pandas installed in your Python environment.

import pandas as pd
import numpy as np

CREATING SAMPLE TIME SERIES DATA

Let’s create a sample time series dataset representing daily sales data over a
month. The dataset will consist of dates and corresponding sales figures.
# Create a date range
date_rng = pd.date_range(start='2023-01-01',
end='2023-01-31', freq='D')

# Create sample sales data with some random values


np.random.seed(0) # For reproducibility
sales_data = np.random.randint(100, 500,
size=(len(date_rng)))

# Create a DataFrame
df = pd.DataFrame(data={'Date': date_rng, 'Sales':
sales_data})
df.set_index('Date', inplace=True)

print("Sample Time Series Data:")


print(df)

DATE-TIME INDEXING

With our DataFrame set up, we can leverage the date-time index to perform
various time series operations. For instance, we can easily access sales data
for specific dates or periods.

Accessing Data by Date

To retrieve sales data for a specific date:

specific_date = df.loc['2023-01-15']
print("\nSales on January 15, 2023:")
print(specific_date)

RESAMPLING DATA

One of the powerful features of time series data is the ability to resample it.
We can aggregate our daily sales data to weekly sales totals.
weekly_sales = df.resample('W').sum()
print("\nWeekly Sales Summary:")
print(weekly_sales)

ROLLING STATISTICS

Another useful technique is calculating rolling statistics, such as the rolling


mean, to understand trends over time. Here, we’ll compute a 7-day rolling
average of sales.

df['Rolling Mean'] = df['Sales'].rolling(window=7).mean()


print("\nDataFrame with 7-Day Rolling Mean:")
print(df)

PLOTTING TIME SERIES DATA

Finally, visualizing time series data can provide insightful trends and patterns.
Using Matplotlib, we can plot the original sales data along with the rolling
mean.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'], label='Daily Sales',
color='blue', marker='o')
plt.plot(df.index, df['Rolling Mean'], label='7-Day
Rolling Mean', color='orange', linewidth=2)
plt.title('Daily Sales Data with 7-Day Rolling Mean')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid()
plt.show()

CONCLUSION

Through this program, we demonstrated how to analyze time series data


using Pandas, including creating a date-time indexed DataFrame, accessing
specific dates, resampling, calculating rolling statistics, and visualizing trends.
Mastering these techniques provides a solid foundation for further
exploration of time series analysis and its applications in various domains.

PROGRAM 9: ADVANCED DATA VISUALIZATION


Data visualization is a pivotal aspect of data analysis, allowing analysts to
present complex datasets in an easily digestible format. In this section, we
will explore advanced visualization techniques using Matplotlib, specifically
focusing on subplots and styling to create compelling visualizations.

UTILIZING SUBPLOTS

Subplots allow for the creation of multiple plots within a single figure, which
is particularly useful for comparing different datasets or visualizing various
aspects of a single dataset side by side. The plt.subplots() function
provides an efficient way to generate a grid of plots.

Example of Creating Subplots

Let's create a figure with multiple subplots to visualize sales data for two
different products across several months. We will use the same sales data
from previous examples but display it across different plot types.

import pandas as pd
import matplotlib.pyplot as plt

# Sample sales data


data = {
'Month': ['January', 'February', 'March', 'April',
'May', 'June'],
'Sales_A': [200, 300, 250, 400, 350, 450],
'Sales_B': [150, 250, 300, 200, 500, 600]
}
df = pd.DataFrame(data)

# Creating subplots
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Sales Data Visualization', fontsize=16)

# Line plot for Product A


axs[0, 0].plot(df['Month'], df['Sales_A'], marker='o',
color='blue', linestyle='-', label='Product A')
axs[0, 0].set_title('Product A Sales')
axs[0, 0].set_xlabel('Month')
axs[0, 0].set_ylabel('Sales')
axs[0, 0].grid()

# Line plot for Product B


axs[0, 1].plot(df['Month'], df['Sales_B'], marker='o',
color='orange', linestyle='-', label='Product B')
axs[0, 1].set_title('Product B Sales')
axs[0, 1].set_xlabel('Month')
axs[0, 1].set_ylabel('Sales')
axs[0, 1].grid()

# Bar plot for comparison


axs[1, 0].bar(df['Month'], df['Sales_A'], width=0.4,
label='Product A', color='blue', alpha=0.7)
axs[1, 0].bar(df['Month'], df['Sales_B'], width=0.4,
label='Product B', color='orange', alpha=0.7,
bottom=df['Sales_A'])
axs[1, 0].set_title('Sales Comparison')
axs[1, 0].set_ylabel('Total Sales')
axs[1, 0].legend()

# Displaying the plots


plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjusting
layout to accommodate title
plt.show()

STYLING THE PLOTS

Styling enhances the readability and aesthetic appeal of visualizations.


Matplotlib offers a variety of customization options, including colors, markers,
grid styles, and labels. Below is an example of how to apply styles to enhance
our plots.

Example of Styling

We will modify our previous plots with specific styles to improve their
presentation:
# Applying styles
plt.style.use('seaborn-darkgrid')

# Creating subplots with styles


fig, axs = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Styled Sales Data Visualization',
fontsize=16)

# Line plot for Product A


axs[0, 0].plot(df['Month'], df['Sales_A'], marker='o',
color='dodgerblue', linewidth=2, label='Product A')
axs[0, 0].set_title('Product A Sales')
axs[0, 0].set_xlabel('Month')
axs[0, 0].set_ylabel('Sales')
axs[0, 0].grid(True)

# Line plot for Product B


axs[0, 1].plot(df['Month'], df['Sales_B'], marker='s',
color='coral', linewidth=2, label='Product B')
axs[0, 1].set_title('Product B Sales')
axs[0, 1].set_xlabel('Month')
axs[0, 1].set_ylabel('Sales')
axs[0, 1].grid(True)

# Bar plot for comparison


axs[1, 0].bar(df['Month'], df['Sales_A'], width=0.4,
label='Product A', color='dodgerblue', alpha=0.7)
axs[1, 0].bar(df['Month'], df['Sales_B'], width=0.4,
label='Product B', color='coral', alpha=0.7,
bottom=df['Sales_A'])
axs[1, 0].set_title('Sales Comparison')
axs[1, 0].set_ylabel('Total Sales')
axs[1, 0].legend()

# Adjusting layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
CONCLUSION

In this program, we covered advanced data visualization techniques using


Matplotlib, focusing on the creation of subplots and the application of styling
to enhance visual appeal. These skills empower data analysts to effectively
communicate insights and trends through visual representation, making their
findings more accessible to diverse audiences.

PROGRAM 10: CORRELATION HEATMAP


Visualizing relationships between variables in a dataset is crucial for
understanding how they interact with one another. One effective way to do
this is by creating a correlation heatmap. In this program, we will utilize the
Seaborn and Matplotlib libraries in Python to generate a heatmap from a
correlation matrix, providing insights into data relationships.

SAMPLE DATAFRAME CREATION

We will start by creating a sample dataset representing various features of a


group of individuals. For this example, let’s assume our dataset consists of
attributes such as age, height, weight, and income.

import pandas as pd
import numpy as np

# Creating a sample DataFrame


data = {
'Age': [25, 30, 35, 40, 45, 50],
'Height': [165, 170, 175, 180, 185, 190],
'Weight': [55, 65, 75, 85, 95, 105],
'Income': [30000, 40000, 50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
CALCULATING THE CORRELATION MATRIX

Next, we will calculate the correlation matrix using the corr() method from
Pandas. This matrix will reveal how strongly the variables are related to each
other.

# Calculating the correlation matrix


correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

GENERATING THE HEATMAP

Now, we will use Seaborn to create a heatmap from the correlation matrix.
The heatmap() function allows for a visually appealing representation of
the correlation values, making it easier to identify relationships.

import seaborn as sns


import matplotlib.pyplot as plt

# Setting the style of the visualization


sns.set(style='white')

# Creating the heatmap


plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f",
cmap='coolwarm', cbar=True, square=True)
plt.title('Correlation Heatmap')
plt.show()

INTERPRETING THE HEATMAP

Once the heatmap is generated, each cell in the matrix will provide the
correlation coefficient between the variables. A value close to 1 indicates a
strong positive correlation, while a value close to -1 indicates a strong
negative correlation. Values around 0 suggest no correlation.

This visualization allows analysts to quickly identify which features are


positively or negatively correlated, guiding further analysis and decision-
making. For instance, in our sample data, we might observe that age and
income are positively correlated, suggesting that as individuals age, their
income tends to increase.

CONCLUSION

In this program, we demonstrated how to generate a correlation heatmap


using Seaborn and Matplotlib. By visualizing the correlation matrix, we can
gain valuable insights into the relationships among different variables, aiding
in data-driven decision-making processes. The ability to create such
visualizations enhances the effectiveness of data analysis and presentation.

INTRODUCTION TO MYSQL
MySQL is an open-source relational database management system (RDBMS)
that utilizes Structured Query Language (SQL) for managing and
manipulating data. Developed by Oracle Corporation, MySQL is widely
recognized for its reliability, flexibility, and ease of use, making it a preferred
choice for both small and large-scale applications. MySQL supports a wide
variety of platforms and can handle large databases, which is essential for
applications that require efficient data storage and retrieval.

MySQL is commonly used in web applications, data warehousing, e-


commerce platforms, and logging applications. Its ability to handle complex
queries and transactions while ensuring data integrity makes it suitable for
applications that need to maintain consistency, such as banking systems or
online retail platforms. Additionally, MySQL is often employed in conjunction
with PHP and JavaScript for server-side programming, enabling developers to
create dynamic web applications that interact with databases.

One of the significant advantages of MySQL is its ability to integrate


seamlessly with Python through various connectors. The
mysql-connector-python library, for example, allows Python developers
to easily connect to MySQL databases, execute queries, and retrieve results
directly within their Python scripts. This integration is powerful for data
analysis, as it enables users to extract data from MySQL databases for
processing with data manipulation libraries like Pandas or for visualization
with libraries like Matplotlib.

By utilizing connectors, Python scripts can perform CRUD (Create, Read,


Update, Delete) operations on MySQL databases, facilitating dynamic data-
driven applications. This integration empowers data analysts and developers
to leverage the strengths of both Python and MySQL, enabling them to build
scalable and efficient data solutions that can handle large datasets and
complex queries.

QUERY 1: CREATING A DATABASE


Creating a database in MySQL is a straightforward process that can be
accomplished using a simple SQL statement. The CREATE DATABASE
statement is used to create a new database, which will serve as a container
for your tables and data. Below is an example of how to create a database
named data_analysis_db .

SQL STATEMENT

CREATE DATABASE data_analysis_db;

EXPECTED OUTPUT

To confirm that the database has been created successfully, you can use the
MySQL command line or a graphical interface like MySQL Workbench. After
executing the above command, you can check for the existence of the new
database by running the following command:

SHOW DATABASES;

The expected output should include the newly created database along with
any other existing databases. It will look something like this:

+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| data_analysis_db |
+--------------------+
This output confirms the successful creation of the data_analysis_db
database, which is now ready to store tables and data for your data analysis
tasks.

IMPORTANT CONSIDERATIONS

When creating a database, ensure that the name you choose follows the
naming conventions and does not conflict with any existing databases.
Additionally, you should have the necessary privileges to create databases in
your MySQL server. If you encounter any errors during the creation process,
check your permissions or syntax to resolve the issue.

QUERY 2: CREATING A TABLE


After successfully creating a database in MySQL, the next step is to create
tables to store structured data. Each table consists of rows and columns,
where each column represents a specific attribute of the data stored in the
table. In this section, we will write SQL code to create a table within the
previously created database, data_analysis_db , along with sample
output to illustrate the result.

SQL STATEMENT FOR CREATING A TABLE

In this example, we will create a table named employees , which will store
information about employees in an organization. The table will include the
following columns: EmployeeID , Name , Age , Department , and
Salary . Here’s the SQL statement to create this table:

USE data_analysis_db;

CREATE TABLE employees (


EmployeeID INT PRIMARY KEY,
Name VARCHAR(100),
Age INT,
Department VARCHAR(50),
Salary DECIMAL(10, 2)
);
EXPLANATION OF THE SQL STATEMENT

• USE data_analysis_db; sets the context to the previously created


database, ensuring that the new table is created within this database.
• CREATE TABLE employees (...); defines a new table named
employees .
• Inside the parentheses, we specify the columns with their respective
data types:
◦ EmployeeID INT PRIMARY KEY : An integer column that
uniquely identifies each employee.
◦ Name VARCHAR(100) : A variable character column to store the
employee's name, allowing up to 100 characters.
◦ Age INT : An integer column representing the employee's age.
◦ Department VARCHAR(50) : A variable character column for the
employee's department, allowing up to 50 characters.
◦ Salary DECIMAL(10, 2) : A decimal column for the employee's
salary, allowing up to 10 digits with 2 decimal places.

EXPECTED OUTPUT

To confirm that the employees table has been created successfully, you can
run the following command to display the structure of the table:

DESCRIBE employees;

The expected output should look like this:

+-------------+---------------+------+-----+---------
+----------------+
| Field | Type | Null | Key | Default |
Extra |
+-------------+---------------+------+-----+---------
+----------------+
| EmployeeID | int(11) | NO | PRI | NULL |
auto_increment |
| Name | varchar(100) | YES | | NULL
| |
| Age | int(11) | YES | | NULL
| |
| Department | varchar(50) | YES | | NULL
| |
| Salary | decimal(10,2) | YES | | NULL
| |
+-------------+---------------+------+-----+---------
+----------------+

This output confirms the successful creation of the employees table with
the specified columns and their data types, making it ready for data insertion
and future queries.

Creating tables is a fundamental step in structuring your database effectively,


and understanding how to define tables with appropriate data types is
essential for efficient data management.

QUERY 3: INSERTING DATA


Inserting data into a table is a fundamental operation in SQL that allows you
to populate your database with relevant information. After creating a table,
the next step is to insert records into it using the INSERT INTO statement.
In this section, we will demonstrate how to insert data into the employees
table we created earlier in the data_analysis_db database.

SQL STATEMENT TO INSERT DATA

We will insert multiple records into the employees table. Here is an


example SQL statement that accomplishes this:

INSERT INTO employees (EmployeeID, Name, Age, Department,


Salary) VALUES
(101, 'Alice', 28, 'HR', 50000.00),
(102, 'Bob', 34, 'IT', 60000.00),
(103, 'Charlie', 29, 'Finance', 55000.00),
(104, 'David', 45, 'IT', 70000.00);

EXPLANATION OF THE SQL STATEMENT

• INSERT INTO employees (EmployeeID, Name, Age,


Department, Salary) : This part specifies the table into which we
want to insert data and lists the columns that will receive the values.
• VALUES : This keyword is followed by a list of tuples, each representing
a record to be inserted into the table. Each tuple contains values
corresponding to the specified columns.

EXPECTED OUTPUT

To verify that the data has been inserted successfully, you can use the
SELECT statement to query the employees table and view the records:

SELECT * FROM employees;

The expected output should look like this:

+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+

This output confirms that the records have been successfully inserted into the
employees table. Each row corresponds to an individual employee, with
their respective attributes accurately represented in the table.

IMPORTANT CONSIDERATIONS

When inserting data, ensure that you adhere to the constraints defined in the
table schema. For example, EmployeeID must be unique because it is the
primary key. Additionally, make sure that the data types of the values being
inserted match those defined for each column. If there are any violations of
these constraints, MySQL will return an error indicating the issue.

Inserting data correctly is crucial for maintaining the integrity and usability of
your database, as it allows for accurate data retrieval and analysis in the
future.
QUERY 4: SELECTING DATA
Selecting data from a table is a fundamental operation in SQL that allows you
to retrieve information stored within a database. The SELECT statement is
used to query the database and fetch specific data from one or more tables.
In this section, we will present a SQL query to select all data from the
employees table that we created earlier in the data_analysis_db
database, along with the expected output.

SQL STATEMENT

To select all data from the employees table, you can use the following SQL
query:

SELECT * FROM employees;

EXPLANATION OF THE SQL STATEMENT

• SELECT * : This part of the statement indicates that you want to


retrieve all columns from the specified table. The asterisk ( * ) is a
wildcard that stands for "all columns."
• FROM employees; : This specifies the table from which the data will be
selected.

EXPECTED OUTPUT

When you execute the above SQL statement, it will return all records stored in
the employees table. The expected output should look like this:

+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+
This output displays all the rows and columns present in the employees
table, providing a comprehensive view of the stored employee data. Each row
corresponds to an employee, with their respective attributes such as
EmployeeID , Name , Age , Department , and Salary .

ADDITIONAL NOTES

Using the SELECT statement is a powerful way to retrieve data for analysis,
reporting, or application use. You can modify this query to include specific
columns by listing them instead of using the asterisk, and you can also apply
conditions using the WHERE clause to filter the results based on specific
criteria. For example, if you wanted to select only employees in the IT
department, you could use the following query:

SELECT * FROM employees WHERE Department = 'IT';

The flexibility of the SELECT statement makes it an essential tool for


interacting with relational databases.

QUERY 5: FILTERING DATA


Filtering data in SQL allows you to retrieve specific records that meet certain
criteria. The SELECT statement can be combined with the WHERE clause to
filter results based on one or more conditions. In this section, we will write a
SQL query to filter records from the employees table based on specific
criteria, along with the expected output.

SQL STATEMENT

For this example, let's filter the employees who belong to the 'IT' department
and have a salary greater than $60,000. The SQL query would look like this:

SELECT * FROM employees


WHERE Department = 'IT' AND Salary > 60000;

EXPLANATION OF THE SQL STATEMENT

• SELECT * : This part of the statement indicates that you want to


retrieve all columns from the specified table.
• FROM employees : This specifies the table from which to select the
data.
• WHERE Department = 'IT' AND Salary > 60000 : This part
applies the filtering criteria. It retrieves only those records where the
Department is 'IT' and the Salary is greater than $60,000. The
AND operator ensures that both conditions must be true for a record to
be included in the results.

EXPECTED OUTPUT

When you execute the above SQL statement, the expected output should
return records that meet the specified conditions. Assuming the following
data in the employees table:

+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+

The output for the filtering query would be:

+-------------+------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+------+-----+------------+---------+
| 104 | David| 45 | IT | 70000.00|
+-------------+------+-----+------------+---------+

This output indicates that only one employee, David, meets the criteria of
being in the 'IT' department with a salary higher than $60,000.

IMPORTANT CONSIDERATIONS

When filtering data, you can use various operators, such as = , > , < , >= ,
<= , and <> (not equal) to define your conditions. Additionally, you can
combine multiple conditions using logical operators like AND , OR , and
NOT to create more complex filters. Properly filtering your data is crucial for
making informed decisions based on specific subsets of your dataset,
allowing you to focus on relevant records.

ADDITIONAL QUERIES 6-20


In this section, we will explore additional SQL queries that cover various
operations such as updating records, deleting records, joining tables, using
aggregate functions, and implementing complex filtering. Each query will
include the SQL code along with an explanation of the expected output.

QUERY 6: UPDATING RECORDS

To update existing records in a table, the UPDATE statement is used. For


example, if we want to increase the salary of all employees in the 'IT'
department by 10%, the SQL statement would look like this:

UPDATE employees
SET Salary = Salary * 1.10
WHERE Department = 'IT';

Expected Output:
After executing this command, the salaries of employees in the 'IT'
department will be updated. If Bob had a salary of $60,000, it would now be
$66,000.

QUERY 7: DELETING RECORDS

To delete records from a table, the DELETE statement is utilized. If we want


to remove an employee named Alice from the employees table, the SQL
statement would be:

DELETE FROM employees


WHERE Name = 'Alice';

Expected Output:
After executing this command, Alice’s record will be removed from the
employees table, and a subsequent SELECT * FROM employees; will
show only Bob, Charlie, and David.
QUERY 8: JOINING TABLES

Joining tables allows you to combine rows from two or more tables based on
a related column. For example, if we have another table named
departments that contains department details, we can join it with the
employees table:

CREATE TABLE departments (


DepartmentID INT PRIMARY KEY,
DepartmentName VARCHAR(50)
);

INSERT INTO departments (DepartmentID, DepartmentName)


VALUES
(1, 'HR'),
(2, 'IT'),
(3, 'Finance');

SELECT e.Name, e.Salary, d.DepartmentName


FROM employees e
JOIN departments d ON e.Department = d.DepartmentName;

Expected Output:
This query will return a list of employee names, their salaries, and the names
of their departments.

QUERY 9: USING AGGREGATE FUNCTIONS (COUNT)

Aggregate functions allow you to perform calculations on a set of values. To


count the number of employees in each department, you can use the
COUNT() function:

SELECT Department, COUNT(*) AS NumberOfEmployees


FROM employees
GROUP BY Department;

Expected Output:
This query will return the number of employees in each department, like so:
+------------+--------------------+
| Department | NumberOfEmployees |
+------------+--------------------+
| HR | 1 |
| IT | 2 |
| Finance | 1 |
+------------+--------------------+

QUERY 10: USING AGGREGATE FUNCTIONS (AVG)

To find the average salary of employees in each department, the AVG()


function can be used:

SELECT Department, AVG(Salary) AS AverageSalary


FROM employees
GROUP BY Department;

Expected Output:
This will return the average salary for employees in each department.

QUERY 11: COMPLEX FILTERING

To filter employees based on multiple criteria, such as those older than 30


years and earning more than $50,000, you can use the following SQL query:

SELECT * FROM employees


WHERE Age > 30 AND Salary > 50000;

Expected Output:
This will return employees who meet both criteria, providing a refined list of
eligible employees.

QUERY 12: USING LIKE FOR PATTERN MATCHING

To find employees whose names start with the letter 'D', you can utilize the
LIKE operator:
SELECT * FROM employees
WHERE Name LIKE 'D%';

Expected Output:
This will return David's record, as he is the only employee whose name starts
with 'D'.

QUERY 13: USING IN FOR MULTIPLE VALUES

To filter employees who work in either 'HR' or 'Finance', you can use the IN
operator:

SELECT * FROM employees


WHERE Department IN ('HR', 'Finance');

Expected Output:
This will return records for Alice and Charlie.

QUERY 14: USING ORDER BY

To sort the employees by their salary in descending order, you can use the
ORDER BY clause:

SELECT * FROM employees


ORDER BY Salary DESC;

Expected Output:
This will return the list of employees sorted by salary from highest to lowest.

QUERY 15: USING HAVING

The HAVING clause is used to filter results after aggregation. For example, to
find departments with more than one employee, you would write:

SELECT Department, COUNT(*) AS NumberOfEmployees


FROM employees
GROUP BY Department
HAVING COUNT(*) > 1;
Expected Output:
This will show any department that has more than one employee.

QUERY 16: USING SUBQUERIES

Subqueries allow you to nest queries. For example, to find employees whose
salary is above the average salary of the entire table:

SELECT * FROM employees


WHERE Salary > (SELECT AVG(Salary) FROM employees);

Expected Output:
This will return records of employees earning more than the average salary.

QUERY 17: USING UNION

To combine results from two different queries, you can use the UNION
operator. For example, to select employees from the 'IT' department and
employees with a salary greater than $60,000:

SELECT Name FROM employees WHERE Department = 'IT'


UNION
SELECT Name FROM employees WHERE Salary > 60000;

Expected Output:
This will return a unique list of names from both queries.

QUERY 18: USING CASE FOR CONDITIONAL LOGIC

To create a derived column that categorizes employees based on their


salaries, you can use the CASE statement:

SELECT Name, Salary,


CASE
WHEN Salary < 60000 THEN 'Below Average'
WHEN Salary BETWEEN 60000 AND 70000 THEN 'Average'
ELSE 'Above Average'
END AS SalaryCategory
FROM employees;
Expected Output:
This will categorize employees based on their salary levels.

QUERY 19: USING GROUP_CONCAT

To list employee names in each department as a single string, you can use
GROUP_CONCAT :

SELECT Department, GROUP_CONCAT(Name) AS Employees


FROM employees
GROUP BY Department;

Expected Output:
This will return departments with a concatenated list of employee names.

QUERY 20: DROPPING A TABLE

To remove a table from the database entirely, you can use the DROP TABLE
statement. For example, to delete the employees table:

DROP TABLE employees;

Expected Output:
After executing this command, the employees table will be permanently
removed from the database.

These queries provide a comprehensive overview of common SQL operations


that are essential for data manipulation and analysis in MySQL.

CONCLUSION
Learning Python for data analysis and SQL for database management equips
students with essential skills that are increasingly vital in today's data-driven
landscape. Python, with its versatile libraries like Pandas and Matplotlib,
allows for efficient data manipulation, analysis, and visualization. These
capabilities enable students to transform raw data into actionable insights,
fostering a deeper understanding of complex datasets. As they become
proficient in Python, students gain the ability to automate tasks, perform
statistical analyses, and create compelling visual narratives that can influence
decision-making processes across various domains.
On the other hand, SQL provides the foundational knowledge necessary for
managing and querying relational databases. Understanding how to
effectively use SQL empowers students to interact with large datasets,
ensuring data integrity while performing operations such as data retrieval,
insertion, updates, and deletions. This skill is particularly beneficial for
students aspiring to work in roles that require database management, such
as data analysts, data scientists, and software developers. Furthermore, the
ability to extract relevant information from databases using SQL enhances the
overall data analysis process, bridging the gap between data storage and
actionable insights.

Together, proficiency in Python and SQL not only prepares students for a
variety of career opportunities but also cultivates critical thinking and
problem-solving skills. As they navigate the complexities of data analysis and
database management, students develop a toolkit that is essential for
contributing to data-driven decision-making in organizations. This
combination of skills positions them as valuable assets in an increasingly
competitive job market, where the demand for data literacy continues to
grow.

You might also like