EX.
NO – 1
DATE:
DOWNLOAD, INSTALL AND EXPLORE THE FEATURES OF NUMPY, SCIPY,
JUPYTER, STATSMODELS AND PANDAS PACKAGES
AIM:
To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
PROCEDURE:
Jupyter Notebook is an interactive browser-based platform for scientific computing and is widely
used in data science. In addition to providing an interactive coding platform, Jupyter Notebook
supports both code and text cells. The text cells allow Markdown formatting. So Plain text, images,
LaTex math equations can be used to explain project’s workflow.
For example, the following image shows how can write both Markdown and code by specifying
the cell type.
Markdown and Code Cells in Jupyter Notebook
To run a cell, the Run [▶] button can be pressed or press Shift + Enter to run a cell. The
headings and images are rendered after running the cells.
Jupyter Notebook Cells
Installation Using the Anaconda Distribution
It’s recommended to use the Anaconda distribution of Python. In addition to Python, it comes
with several useful data science packages pre-installed. The installation also includes Jupyter
tools like Jupyter Notebook and JupyterLab.
Steps in this installation.
Step 1: Head over to the official website of Anaconda. Then, navigate
to anaconda.com/products/individual. And download the installer corresponding to your
operating system.
Installing the Anaconda Distribution
Step 2: Now, run the installer. Follow the prompts on your screen to complete the installation.
The installation will typically take a few minutes. ⏳
Launch Jupyter Notebook once the installation process is completed.
Step 3: Once installation is completed, launch Anaconda Navigator. From the navigator, click
on the Launch option in the Jupyter Notebook tab, as shown below:
Launching Jupyter Notebook from Anaconda Navigator
Or the Jupyter Notebook shortcut is used to launch, as illustrated below:
Also launch jupyter notebook from the Anaconda Command Prompt.
NUMPY:
▪ Introduces objects for multidimensional arrays and matrices, as well as functions
that allow to easily perform advanced mathematical and statistical operations on
those objects
▪ provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
▪ many other python libraries are built on NumPy
SciPy:
▪ collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
▪ part of SciPy Stack
▪ built on NumPy
Pandas:
▪ adds data structures and tools designed to work with table-like data (similar to
Series and Data Frames in R)
▪ provides tools for data manipulation: reshaping, merging, sorting, slicing,
aggregation etc.
▪ allows handling missing data
Statsmodels:
statsmodels is a Python package that provides a complement to scipy for statistical computations
including descriptive statistics and estimation and inference for statistical models.
RESULT:
EX. NO – 1.a
DATE:
PERFORM BASIC DATA ANALYTICS COMMUNICATION
PROCESS WITH NUMPY
AIM
To write a program for construct and perform the basic data analytics communication
process with numpy and other packages
PROCEDURE
STEP 1: Start
STEP 2: Import the numpy package file for our program.
STEP 3: Perform the Array operation using numpy package
STEP 4: Import the numpy package file for our program.
STEP 5: Perform the Array creating operation using numpy package
STEP 6: Import the numpy package file for our program.
STEP 7: Perform the Array indexing operation using numpy package
STEP 8: Import the numpy package file for our program.
STEP 9: Perform the Operations on single Array using numpy package
STEP 10: Import the numpy package file for our program.
STEP 11: Perform the Unary operation using numpy package
STEP 12: Import the numpy package file for our program.
STEP 13: Perform the Binary operation using numpy package
STEP 14: Import the numpy package file for our program.
STEP 15: Perform the Universal functions (ufunc) using numpy package
STEP 16: Import the numpy package file for our program.
STEP 17: Perform the Sorting array: using numpy package
STEP 18: End
SOURCE CODE
A) ARRAY IN NUMPY
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2, 3], [ 4, 2, 5]] )
print(arr)
# Printing type of arr object
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)
OUTPUT
B) ARRAY CREATION
# array creation techniques
import numpy as np
# Creating array from list with type float
a = np.array([[1, 2, 4], [5, 8, 7]], dtype = 'float')
print ("Array created using passed list:\n", a)
# Creating array from tuple
b = np.array((1 , 3, 2))
print ("\nArray created using passed tuple:\n", b)
# Creating a 3X4 array with all zeros
c = np.zeros((3, 4))
print ("\nAn array initialized with all zeros:\n", c)
# Create a constant value array of complex type
d = np.full((3, 3), 6, dtype = 'complex')
print ("\nAn array initialized with all 6s.""Array type is complex:\n", d)
# Create an array with random values
e = np.random.random((2, 2))
print ("\nA random array:\n", e)
# Create a sequence of integers
# from 0 to 30 with steps of 5
f = np.arange(0, 30, 5)
print ("\nA sequential array with steps of 5:\n", f)
# Create a sequence of 10 values in range 0 to 5
g = np.linspace(0, 5, 10)
print ("\nA sequential array with 10 values between""0 and 5:\n", g)
# Reshaping 3X4 array to 2X2X3 array
arr = np.array([[1, 2, 3, 4], [5, 2, 4, 2], [1, 2, 0, 1]])
newarr = arr.reshape(2, 2, 3)
print ("\nOriginal array:\n", arr)
print ("Reshaped array:\n", newarr)
# Flatten array
arr = np.array([[1, 2, 3], [4, 5, 6]])
flarr = arr.flatten()
print ("\nOriginal array:\n", arr)
print ("Fattened array:\n", flarr)
OUTPUT
C) ARRAY INDEXING
# indexing in numpy
import numpy as np
# An exemplar array
arr = np.array([[-1, 2, 0, 4], [4, -0.5, 6, 0], [2.6, 0, 7, 8], [3, -7, 4, 2.0]])
# Slicing array
temp = arr[:2, ::2]
print ("Array with first 2 rows and alternate columns(0 and 2):\n", temp)
# Integer array indexing example
temp = arr[[0, 1, 2, 3], [3, 2, 1, 0]]
print ("\nElements at indices (0, 3), (1, 2), (2, 1),""(3, 0):\n", temp)
# boolean array indexing example
cond = arr > 0
# cond is a boolean array
temp = arr[cond]
print ("\nElements greater than 0:\n", temp)
OUTPUT
D) OPERATIONS ON SINGLE ARRAY
# basic operations on single array
import numpy as np
a = np.array([1, 2, 5, 3])
# add 1 to every element
print ("Adding 1 to every element:", a+1)
# subtract 3 from each element
print ("Subtracting 3 from each element:", a-3)
# multiply each element by 10
print ("Multiplying each element by 10:", a*10)
# square each element
print ("Squaring each element:", a**2)
# modify existing array
a *= 2
print ("Doubled each element of original array:", a)
# transpose of array
a = np.array([[1, 2, 3], [3, 4, 5], [9, 6, 0]])
print ("\nOriginal array:\n", a)
print ("Transpose of array:\n", a.T)
OUTPUT
E) UNARY OPERATORS
# unary operators in numpy
import numpy as np
arr = np.array([[1, 5, 6], [4, 7, 2], [3, 1, 9]])
# maximum element of array
print ("Largest element is:", arr.max())
print ("Row-wise maximum elements:", arr.max(axis = 1))
# minimum element of array
print ("Column-wise minimum elements:", arr.min(axis = 0))
# sum of array elements
print ("Sum of all array elements:", arr.sum())
# cumulative sum along each row
print ("Cumulative sum along each row:\n", arr.cumsum(axis = 1))
OUTPUT
F) BINARY OPERATORS
# binary operators in Numpy
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[4, 3], [2, 1]])
# add arrays
print ("Array sum:\n", a + b)
# multiply arrays (elementwise multiplication)
print ("Array multiplication:\n", a*b)
# matrix multiplication
print ("Matrix multiplication:\n", a.dot(b))
OUTPUT
G) UNIVERSAL FUNCTIONS (ufunc)
# universal functions in numpy
import numpy as np
# create an array of sine values
a = np.array([0, np.pi/2, np.pi])
print ("Sine values of array elements:", np.sin(a))
# exponential values
a = np.array([0, 1, 2, 3])
print ("Exponent of array elements:", np.exp(a))
# square root of array values
print ("Square root of array elements:", np.sqrt(a))
OUTPUT
H) SORTING ARRAY
# Python program to demonstrate sorting in numpy
import numpy as np
a = np.array([[1, 4, 2], [3, 4, 6], [0, -1, 5]])
# sorted array
print ("Array elements in sorted order:\n", np.sort(a, axis = None))
# sort array row-wise
print ("Row-wise sorted array:\n", np.sort(a, axis = 1))
# specify sort algorithm
print ("Column wise sort by applying merge- sort:\n", np.sort(a, axis = 0, kind =
'mergesort'))
# Example to show sorting of structured array
# set alias names for dtypes
dtypes = [('name', 'S10'), ('grad_year', int), ('cgpa', float)]
# Values to be put in array
values = [('Hrithik', 2009, 8.5), ('Ajay', 2008, 8.7), ('Pankaj', 2008,
7.9), ('Aakash', 2009, 9.0)]
# Creating array
arr = np.array(values, dtype = dtypes)
print ("\nArray sorted by names:\n", np.sort(arr, order = 'name'))
print ("Array sorted by graduation year and then cgpa:\n", np.sort(arr, order =
['grad_year', 'cgpa']))
OUTPUT
RESULT
EX. NO – 1.b
DATE:
PERFORM INTEGRATION USING SCIPY SUB PACKAGE
Aim:
To write a program to construct and demonstrate the integration
operations using Scipy Sub packages.
PROCEDURE
STEP 1: Start
STEP 2: Import the scipy package file for our program.
STEP 3: Perform the Integration operations
STEP 4: Perform the Single Integration operation
STEP 5: Perform the double integration operations
STEP 6: End
SOURCE CODE:
Single Integration:
from scipy import integrate
# take f(x) function as f
f = lambda x : (x**2)/2
#single integration with a = 0 & b = 1
integration = integrate.quad(f, 0 , 1)
print(integration)
OUTPUT:
Double Integration:
from scipy import integrate
import numpy as np
#import square root function from math lib
from math import sqrt
# set fuction f(x)
f = lambda x, y : 64 *x*y
# lower limit of second integral
p = lambda x : 0
# upper limit of first integral
q = lambda y : sqrt(1 - 2*y**2)
# perform double integration
integration = integrate.dblquad(f , 0 , 2/4, p, q)
print(integration)
OUTPUT:
RESULT:
EX. NO – 1.c
DATE:
APPLY QUANTITATIVE TECHNIQUE USING
APPROPRIATE PACKAGES IN PYTHON
AIM:
To write a program to construct and demonstrate the
Quantitative
Techniques Using Appropriate Packages in Python.
PROCEDURE
STEP 1: Start
STEP 2: Import the pandas package file for our program.
STEP 3: Download iris_csv.csv file and save it in system.
STEP 4: Read the data(.csv) file
STEP 5: Perform the shape function
STEP 6: Perform the info() function
STEP 7: Perform the head() function
STEP 8: Perform the tail() function
STEP 9: Perform the mean() function
STEP 10: Perform the median() function
STEP 11: Perform the min() function
STEP 12: Perform the max() function
STEP 13: Perform the count() function
STEP 14: Perform the std() function
STEP 15: Perform the corr() function
STEP 16: Perform the describe() function
STEP 17: End
SOURCE CODE:
import numpy as np
import pandas as pd
df = pd.read_csv("C:/Users/mani4/Documents/Python
Scripts/iris_csv.csv")
# Prints number of rows and columns in
dataframe
df.shape
# Index, Datatype and Memory
information
df.info()
# Prints first n rows of the
DataFrame df.head()
# Prints last n rows of the
DataFrame
df.tail()
# Returns the mean of all
columns df.mean()
# Returns the median of each column
df.median()
# Returns the lowest value in each
column df.min()
# Returns the highest value in each column
df.max()
# Returns the number of non-null values in each DataFrame column
df.count()
# Returns the standard deviation of each
column df.std()
# Returns the correlation between columns in a DataFrame
df.corr()
# Summary statistics for numerical
columns df.describe()
RESULT:
Ex2. Working with Numpy arrays
Aim:
Working with different instructions used in Numpy array
Procedure
✓ Do the following function
✓ To print the array print(a)
✓ To print the shape of the array print(a.shape)
✓ To the function return the number of dimensions of an array.
print(a.ndim)
✓ This data type object (dtype) informs us about the layout of the array.
print(a.dtype.name)
✓ This returns the size (in bytes) of each element of a NumPy array
print(a.itemsize)
✓ This is the total number of elements in the ndarray. print(a.size)
Program
import numpy as np
a = np.arange(15).reshape(3, 5)
To print the array
print(a)
To print the shape of the array
print(a.shape)
To the function return the number of dimensions of an array.
print(a.ndim)
This data type object (dtype) informs us about the layout of the array
print(a.dtype.name)
This returns the size (in bytes) of each element of a NumPy array
print(a.itemsize)
This is the total number of elements in the ndarray.
print(a.size)
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 1)
print(x)
Slice elements from index 1 to index 5 from the following array
print(arr[1:5])
Get third and fourth elements from the following array and add them.
print(arr[2] + arr[3])
Write a NumPy program to convert the values of Centigrade degrees into
Fahrenheit degrees and vice versa. Values are stored into a NumPy array.
import numpy as np
fvalues = [0, 12, 45.21, 34, 99.91, 32]
F = np.array(fvalues)
print("Values in Fahrenheit degrees:")
print(F)
print("Values in Centigrade degrees:")
print(np.round((5*F/9 - 5*32/9),2))
Ex3 : Working with Pandas data frames
Aim
To work with Pandas data frames
Procedure
1. Create a dataset with name, city, age and pyscore.
2. Use .head() to show the first few items and .tail() to show the last few
items.
3. Get the DataFrame’s row labels with .index and its column labels
with .columns
4. Get the data types for each column of a Pandas DataFrame with .dtypes
5. Check the amount of memory used by each column.
6. Accessor .loc[], which you can use to get rows or columns by their labels,
Pandas offers the accessor .iloc[], which retrieves a row or column by its
integer index.
7. Creating a new Series object that represents this new candidate
Program
import pandas as pd
data = {
'name': ['Muthu', 'Anand', 'Ramkumar', 'Roja', 'Robin', 'Rajan', 'Joel'],
'city': ['Chennai', 'Madurai', 'Tirunelveli', 'Saidapet','Tambaram', 'Irukattukott
ai', 'Central Station'],
'age': [41, 28, 33, 34, 38, 31, 37],
'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}
row_labels = [101, 102, 103, 104, 105, 106, 107]
df is a variable that holds the reference to your Pandas DataFrame
df = pd.DataFrame(data=data, index=row_labels)
df
We can use .head() to show the first few items and .tail() to show the last few
items.
df.head(n=2)
df.tail(n=2)
You can get the DataFrame’s row labels with .index and its column labels
with .columns
df.index
df.columns
We can get the data types for each column of a Pandas DataFrame with .dtypes
df.dtypes
You can even check the amount of memory used by each column
with .memory_usage()
df.memory_usage()
In addition to the accessor .loc[], which you can use to get rows or columns by
their labels, Pandas offers the accessor .iloc[], which retrieves a row or column
by its integer index.
df.loc[101]
df.iloc[0]
We can start by creating a new Series object that represents this new candidate
john = pd.Series(data=['Jovan', 'Medavakkam', 34, 79],index=df.columns, name
=17)
john
df = df.append(john)
df
Ex 4a. Reading data from text files
Aim
Write a python program to Reading data from text files
Procedure
✓ Type the text file with some data and save as filename.txt.
✓ Load the text file using the open() function with write command.
✓ Do the functions like read(), write() commands as needed.
Program
# Program to show various ways to read and
# write data in a file.
file1 = open("myfile.txt","w")
L = ["This is Delhi \n","This is Paris \n","This is London \n"]
# \n is placed to indicate EOL (End of Line)
file1.write("Hello \n")
file1.writelines(L)
file1.close() #to change file access modes
file1 = open("myfile.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
# seek(n) takes the file handle to the nth
# bite from the beginning.
file1.seek(0)
print( "Output of Readline function is ")
print(file1.readline())
print()
file1.seek(0)
# To show difference between read and readline
print("Output of Read(9) function is ")
print(file1.read(9))
print()
file1.seek(0)
print("Output of Readline(9) function is ")
print(file1.readline(9))
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print()
file1.close()
Ex 4b. Reading data from excel files does exploring various commands
Aim
Write a python program to Reading data from excel files
Procedure
• First of all create an excel file with some 10 records(10 rows) and 5
columns with numerical data and save them as filename.xlsx format.
• Reading data from excel files into pandas using Python.
• Exploring the data from excel files in Pandas.
• Using functions to manipulate and reshape the data in Pandas.
To view 5 columns from the top and from the bottom of the data.
The shape() method can be used to view the number of rows and columns.
If any column contains numerical data, we can sort that column using
the sort_values() method in pandas.
Suppose our data is mostly numerical. We can get the statistical information like
mean, max, min, etc. about the data frame using the describe() method.
Program
import pandas as pd
data = pd.read_excel (r'd:\mark.xlsx')
data
df = pd.DataFrame(data, columns= ['Name','CGPA'])
print (df)
data.head()
data.tail()
data.shape
sorted_column = data.sort_values(['Name'], ascending = False)
sorted_column['Name'].head(5)
data.describe()
Ex 4c. Exploring various commands for doing descriptive analytics on the
Iris data set.
Aim
To explore various commands for doing descriptive analytics on the Iris data
set.
Procedure
✓ To understand idea behind Descriptive Statistics.
✓ Load the packages we will need and also the `iris` dataset.
✓ load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
✓ Basic statistics.
✓ This number is the number of rows in the dataset, and can be obtained
via `count()`.
✓ Mean for every numeric column
✓ Median for every numeric column
✓ variance is a measure of dispersion, roughly the “average” squared
distance of a data point from the mean.
✓ The standard deviation is the square root of the variance and interpreted
as the “average” distance a data point is from the mean.
✓ The maximum and minimum values.
Program Code
import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris
# sklearn.datasetsincludes common example datasets
# A function to load in the iris dataset
iris_obj = load_iris()
# Dataset preview
iris_obj.data
iris = DataFrame(iris_obj.data, columns=iris_obj.feature_nam
es,index=pd.Index([i for i in range(iris_obj.data.shape[0])]
)).join(DataFrame(iris_obj.target, columns=pd.Index(["specie
s"]), index=pd.Index([i for i in range(iris_obj.target.shape
[0])])))
iris # prints iris data
Commands
iris_obj.feature_names
iris.count()
iris.mean()
iris.median()
iris.var()
iris.std()
iris.max()
iris.min()
iris.describe()
Result
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.
5. a. Use the diabetes data set from UCI and Pima Indians
Diabetes data set for performing the following:
Univariate analysis: Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis.
Aim:
Analysis the various univariate functions like Frequency, Mean, Median,
Mode, Variance, Standard Deviation, Skewness and Kurtosis on dataset
like Pima Indian diabetes dataset.
Procedure
• Download dataset like Pima Indian diabetes dataset. Save them in
any drive and call them for process.
• The mean() function can be used to calculate mean/average of a
given list of numbers.
• The median() method calculates the median (middle value) of the
given data set.
• The mode of a set of data values is the value that appears most
often.
• The var() method calculates the variance for each column.
• Standard deviation std() is a number that describes how spread out
the values are.
• The skew() method calculates the skew for each column. Skewness
refers to a distortion or asymmetry that deviates from the
symmetrical bell curve, or normal distribution, in a set of data.
• Kurtosis:
• It is also a statistical term and an important characteristic of
frequency distribution. It determines whether a distribution is
heavy-tailed in respect of the normal distribution. It provides
information about the shape of a frequency distribution.
Program:
import pandas as pd
from scipy.stats import kurtosis
import pylab as p
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
df1 = pd.DataFrame(df, columns= ['Age','Glucose'])
print (df1)
df1.mean()
df1.median()
df1.mode()
print(df1.var())
df1.std()
print(df1.skew())
print(kurtosis(df, axis=0, bias=True))
Dataset download link
https://github.com/npradaschnor/Pima-Indians-Diabetes-
Dataset/blob/master/Pima%20Indians%20Diabetes%20Dataset.ipynb
Result:
The various univariate functions like Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis on dataset Pima
Indian diabetes dataset are successfully executed.
5 b. Linear Regression and Logistic Regression with the Diabetes Dataset
Using Python Machine Learning
Aim
In this experiment we use the diabetes dataset from sklearn and then we need to
implement the Linear Regression over this:
Procedure
Load sklearn Libraries.
Load Data
Load the diabetes dataset
Split Dataset
Creating Model Linear Regression and Logistic Regression
Make predictions using the testing set
Finding Coefficient And Mean Square Error
Program
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#To calculate accuracy measures and confusion matrix
from sklearn import metrics
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# Create Logistic regression object
Logistic_model = LogisticRegression()
Logistic_model.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(diabetes_y_test, diabetes_y_pred))
y_predict = Logistic_model.predict(diabetes_X_train)
#print("Y predict/hat ", y_predict)
y_predict
Output
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
5 c. Use the diabetes data set from UCI and Pima Indians Diabetes data set
for performing the following: Multiple Regression
Aim
Multiple regression is like linear regression, but with more than one independent
value, meaning that we try to predict a value based on two or more variables.
Procedure
The Pandas module allows us to read csv files and return a DataFrame object.
Then make a list of the independent values and call this variable X.
Put the dependent values in a variable called y.
From the sklearn module we will use the LinearRegression() method to create a
linear regression object.
This object has a method called fit() that takes the independent and dependent
values as parameters and fills the regression object with data that describes the
relationship.
We have a regression object that are ready to predict age values based on a person
Glucose and BloodPressure
Program
import pandas as pd
from sklearn import linear_model
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
predictedage = regr.predict([[150, 13]])
print(predictedage)
Output
[28.77214401]
5 d. Also compare the results of the above analysis for the two data sets.
Aim
In this program, we can compare the results of the two different data sets.
Procedure
Step 1: Prepare the datasets to be compared
Step 2: Create the two DataFrames
Based on the above data, you can then create the following two DataFrames
Step 3: Compare the values between the two Pandas DataFrames
In this step, you’ll need to import the NumPy package.
Let’s say that you have the following data stored in a CSV file called car1.csv
While you have the data below stored in a second CSV file called car2.csv
Program
import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')
df1 = pd.DataFrame(data_1)
data_2 = pd.read_csv(r'd:\car2.csv')
df2 = pd.DataFrame(data_2)
df1['amount1'] = df2['amount1']
df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')
df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] -
df2['amount1'])
print(df1)
Output
Model City Year amount amount1 prices_match price_diff
0 Maruti Chennai 2022 600000 600000 True 0
1 Hyndai Chennai 2022 700000 700000 True 0
2 Ford Chennai 2022 800000 850000 False -50000
3 Kia Chennai 2022 900000 900000 True 0
4 XL6 Chennai 2022 1000000 1000000 True 0
5 Tata Chennai 2022 1100000 1150000 False -50000
6 Audi Chennai 2022 1200000 1200000 True 0
7 Ertiga Chennai 2022 1300000 1300000 True 0
Please click here to download the Dataset
Dataset 1: car1.csv
Dataset 2: car2.csv
6 a). Apply and explore various plotting functions on UCI data sets. Density
and contour plots
Aim
To apply and explore various plotting functions like Density and contour
plots on datasets.
Procedure
There are three Matplotlib functions that can be helpful for this
task: plt.contour for contour plots, plt.contourf for filled contour plots,
and plt.imshow for showing images
A contour plot can be created with the plt.contour function. It takes three
arguments: a grid of x values, a grid of y values, and a grid of z values.
The x and y values represent positions on the plot, and the z values will be
represented by the contour levels.
Perhaps the most straightforward way to prepare such data is to use
the np.meshgrid function, which builds two-dimensional grids from one-
dimensional arrays.
Next standard line-only contour plot and for color the lines can be color-coded
by specifying a colormap with the cmap argument.
Additionally, we'll add a plt.colorbar() command, which automatically creates an
additional axis with labeled color information for the plot.
Program
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black');
Output
plt.contour(X, Y, Z, 20, cmap='RdGy');
Output
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
Output
Result
Various plotting functions like Density and contour plots on datasets are
successfully executed.
6 b). Apply and explore various plotting functions like Correlation and
scatter plots on UCI data sets
Aim
To apply and explore various plotting functions like Correlation and
scatter plots on datasets.
Procedure
Program
import pandas as pd
con = pd.read_csv('D:/diabetes.csv')
con
list(con.columns)
import seaborn as sns
sns.scatterplot(x="Pregnancies", y="Age", data=con);
Output
sns.lmplot(x="Pregnancies", y="Age", data=con);
Output
sns.lmplot(x="Pregnancies", y="Age", hue="Outcome", data=con);
Output
from scipy import stats
stats.pearsonr(con['Age'], con['Outcome'])
Output
(0.23835598302719774, 2.209975460664566e-11)
cormat = con.corr()
round(cormat,2)
sns.heatmap(cormat);
Output
Result
Various plotting functions like Correlation and scatter plots on datasets
are successfully executed.
6 c. Apply and explore histograms and three dimensional plotting functions on
UCI data sets
Aim
To apply and explore histograms and three dimensional plotting functions
on UCI data sets
Procedure
✓ Download CSV file and upload to explore.
✓ A histogram is basically used to represent data provided in a form of some
groups.
✓ To create a histogram the first step is to create bin of the ranges, then distribute
the whole range of the values into a series of intervals, and count the values
which fall into each of the intervals.
✓ Bins are clearly identified as consecutive, non-overlapping intervals of
variables.The matplotlib.pyplot.hist() function is used to compute and create
histogram of x.
✓ The first one is a standard import statement for plotting using matplotlib,
which you would see for 2D plotting as well.
✓ The second import of the Axes3D class is required for enabling 3D
projections. It is, otherwise, not used anywhere else.
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # To visualize
from mpl_toolkits.mplot3d import Axes3D
data = pd.read_csv('d:\\diabetes.csv')
data
data['Glucose'].plot(kind='hist')
Output
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
Output
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = data['Age'].values
y = data['Glucose'].values
z = data['Outcome'].values
ax.set_xlabel("Age (Year)")
ax.set_ylabel("Glucose (Reading)")
ax.set_zlabel("Outcome (0 or 1)")
ax.scatter(x, y, z, c='r', marker='o')
plt.show()
Output
Result
The histograms and three dimensional plotting functions on UCI data sets are
successfully executed.