KEMBAR78
Basics of Pandas | PDF | Comma Separated Values | Computer Programming
0% found this document useful (0 votes)
35 views5 pages

Basics of Pandas

The document provides a comprehensive overview of using the pandas library in Python for data manipulation, including reading CSV files, filtering rows, selecting columns, and handling date types. It also covers various functions such as df.head(), df.tail(), and df.describe() to analyze dataframes. Additionally, it explains how to manage data types and handle missing values in datasets.

Uploaded by

iamsrijan47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

Basics of Pandas

The document provides a comprehensive overview of using the pandas library in Python for data manipulation, including reading CSV files, filtering rows, selecting columns, and handling date types. It also covers various functions such as df.head(), df.tail(), and df.describe() to analyze dataframes. Additionally, it explains how to manage data types and handle missing values in datasets.

Uploaded by

iamsrijan47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

import pandas as pd
df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv")
df ->simply displays the dataframe
ID Name Companies Profit Growth
1 Lamtone IT Services 5274553 30%
2 Stripfind Finance 23797493 20%
3 Canecorporation Health 0 7%
4 Mattouch IT Services 0 26%
5 Techdrill Insurance 0 8%
6 Techline Health 123455 23%
7 Cityace Health 3005116 6%
8 Kayelectronics Health 5573830 4%
9 Ganzlax IT Services 452893 18%
10 Trantraxlax Govt Services 5453060 7%

df['Name']
Selecting columns.

df[df['Age'] > 30]


Filtering rows.

df['Age_plus_5'] = df['Age'] + 5
To add new columns

We can also read using url.


2.df=pd.read_csv(name of the file, sep='\t',names=['col1', ...]) for tsv file

3.df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv",
index_col='Name')
this sets the index column to the name
4.df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv", header=1)
this sets the header to 1.

1 Lamtone IT Services 5274553 30%


0 2 Stripfind Finance 23797493 20%
1 3 Canecorporation Health 0 7%
2 4 Mattouch IT Services 0 26%
3 5 Techdrill Insurance 0 8%
4 6 Techline Health 123455 23%
5 7 Cityace Health 3005116 6%
6 8 Kayelectronics Health 5573830 4%
7 9 Ganzlax IT Services 452893 18%
8 10 Trantraxlax Govt Services 5453060 7%

4.usecols=[1,4] to extract columns from 1 and 4


To extract columns 1 to 4, we would use? homework
usecols=list(range(1,5))
5.squeeze=true
6.skiprows=[2,4] to skip rows 2 and 4
df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv", usecols =
[1,4],skiprows=[1,3])
df
Name Growth
0 Stripfind 20%
1 Mattouch 26%
2 Techdrill 8%
3 Techline 23%
4 Cityace 6%
5 Kayelectronics 4%
6 Ganzlax 18%
7 Trantraxlax 7%

7.df.head(10) to extract 10 rows

8.To only select a few rows: pd.read_csv('fortune.csv',nrows=5)

9.df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv", usecols =
[1,4],skiprows=[1,3], dtype={'Profit':float})
df
Name Growth
0 Stripfind 20%
1 Mattouch 26%
2 Techdrill 8%
3 Techline 23%
4 Cityace 6%
5 Kayelectronics 4%
6 Ganzlax 18%
7 Trantraxlax 7%

There are 4 data structures in python: list, set,tuple , dictionary.


Dictionary stores key value pairs. {key : value}

10. To use dates for what they are instead of reading them as strings/objects , we
need to use it the following way.
The Dtype should be a date otherwise we will not be able to use that data.
parse_dates=['Date']
df['Date'] = pd.to_datetime(df['Date'])
df.info()
to extract the date. Likewise, use to_Month(df['Date'])

import pandas as pd
df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10_new.csv",
parse_dates=['Date'])
df.info()
df['Date'] = pd.to_datetime(df['Date'].astype(str).str.strip(), errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 144 non-null object
1 Date 144 non-null object
2 rank 144 non-null object
3 rank_change 142 non-null float64
4 revenue 144 non-null float64
5 profit 144 non-null object
6 num. of employees 143 non-null float64
7 sector 144 non-null object
8 city 144 non-null object
9 state 144 non-null object
10 prev_rank 144 non-null object
11 CEO 144 non-null object
12 Website 144 non-null object
13 Ticker 137 non-null object
14 Market Cap 138 non-null float64
dtypes: float64(4), object(11)
memory usage: 117.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 company 144 non-null object
1 Date 6 non-null datetime64[ns]
2 rank 144 non-null object
3 rank_change 142 non-null float64
4 revenue 144 non-null float64
5 profit 144 non-null object
6 num. of employees 143 non-null float64
7 sector 144 non-null object
8 city 144 non-null object
9 state 144 non-null object
10 prev_rank 144 non-null object
11 CEO 144 non-null object
12 Website 144 non-null object
13 Ticker 137 non-null object
14 Market Cap 138 non-null float64

df['Date'] = pd.to_datetime(df['Date'],errors='coerce') also works.

11. df.info() gives the metadata of the dataframe.It gives how many entries are
there, the number of data that are non null,etc.
12.df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv",
na_values=['True'])
df
na_values additional values as NA/NaN .
//
pd.read_csv(
filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
*,
sep: 'str | None | lib.NoDefault' = <no_default>,
delimiter: 'str | None | lib.NoDefault' = None,
header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
index_col: 'IndexLabel | Literal[False] | None' = None,
usecols: 'UsecolsArgType' = None,
dtype: 'DtypeArg | None' = None,
engine: 'CSVEngine | None' = None,
converters: 'Mapping[Hashable, Callable] | None' = None,
true_values: 'list | None' = None,
false_values: 'list | None' = None,
skipinitialspace: 'bool' = False,
skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
skipfooter: 'int' = 0,
nrows: 'int | None' = None,
na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable,
Iterable[Hashable]] | None' = None,
keep_default_na: 'bool' = True,
na_filter: 'bool' = True,
verbose: 'bool | lib.NoDefault' = <no_default>,
skip_blank_lines: 'bool' = True,
parse_dates: 'bool | Sequence[Hashable] | None' = None,
infer_datetime_format: 'bool | lib.NoDefault' = <no_default>,
keep_date_col: 'bool | lib.NoDefault' = <no_default>,
date_parser: 'Callable | lib.NoDefault' = <no_default>,
date_format: 'str | dict[Hashable, str] | None' = None,
dayfirst: 'bool' = False,
cache_dates: 'bool' = True,
iterator: 'bool' = False,
chunksize: 'int | None' = None,
compression: 'CompressionOptions' = 'infer',
thousands: 'str | None' = None,
decimal: 'str' = '.',
lineterminator: 'str | None' = None,
quotechar: 'str' = '"',
quoting: 'int' = 0,
doublequote: 'bool' = True,
escapechar: 'str | None' = None,
comment: 'str | None' = None,
encoding: 'str | None' = None,
encoding_errors: 'str | None' = 'strict',
dialect: 'str | csv.Dialect | None' = None,
on_bad_lines: 'str' = 'error',
delim_whitespace: 'bool | lib.NoDefault' = <no_default>,
low_memory: 'bool' = True,
memory_map: 'bool' = False,
float_precision: "Literal['high', 'legacy'] | None" = None,
storage_options: 'StorageOptions | None' = None,
dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>,
) -> 'DataFrame | TextFileReader'
//

12.
import pandas as pd
df1=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/IPL_dataset.csv")
df1.info()
type(df1)

13.
df.head()
df.tail()
Returns the head and tail
df.shape gives the rows, columns of the df

14. df.describe describes values such as max value, min value, mean ,standard
deviation, percentile 25,50 and 75.
It gives the internal analysis of the data frame of only the numerical data types
so onject data types are ignored.

15. iloc and loc depending on the index value will return the corresponding row.

df1['winner']
0 Sunrisers Hyderabad
1 Rising Pune Supergiant
2 Kolkata Knight Riders
3 Kings XI Punjab
4 Royal Challengers Bangalore
...
631 Royal Challengers Bangalore
632 Royal Challengers Bangalore
633 Sunrisers Hyderabad
634 Sunrisers Hyderabad
635 Sunrisers Hyderabad
Name: winner, Length: 636, dtype: object

16.df1[['team1','team2','winner']]

team1 team2 winner


0 Sunrisers Hyderabad Royal Challengers Bangalore Sunrisers Hyderabad
1 Mumbai Indians Rising Pune Supergiant Rising Pune Supergiant
2 Gujarat Lions Kolkata Knight Riders Kolkata Knight Riders
3 Rising Pune Supergiant Kings XI Punjab Kings XI Punjab
4 Royal Challengers Bangalore Delhi Daredevils Royal Challengers Bangalore
... ... ... ...
631 Delhi Daredevils Royal Challengers Bangalore Royal Challengers Bangalore
632 Gujarat Lions Royal Challengers Bangalore Royal Challengers Bangalore
633 Sunrisers Hyderabad Kolkata Knight Riders Sunrisers Hyderabad
634 Gujarat Lions Sunrisers Hyderabad Sunrisers Hyderabad
635 Sunrisers Hyderabad Royal Challengers Bangalore Sunrisers Hyderabad

17.df1.iloc[2:10:3]
Shows rows 2 to 10 skipping 3 rows at a time.
type(df)
pandas.core.frame.DataFrame

df=pd.read_csv("C:/Users/MSCLAB-32/Desktop/Pandas/Fortune_10.csv",
dtype={'Profit':float})
df
ID Name Companies Profit Growth
0 1 Lamtone IT Services 5274553.0 30%
1 2 Stripfind Finance 23797493.0 20%
2 3 Canecorporation Health 0.0 7%
3 4 Mattouch IT Services 0.0 26%
4 5 Techdrill Insurance 0.0 8%
5 6 Techline Health 123455.0 23%
6 7 Cityace Health 3005116.0 6%
7 8 Kayelectronics Health 5573830.0 4%
8 9 Ganzlax IT Services 452893.0 18%
9 10 Trantraxlax Govt Services 5453060.0 7%

You might also like