PANDAS
You could create a Pandas series from an array-like
object using the following command: pd.Series(data, dtype)
To create a dataframe from a dictionary, you can run
the following command: pd.DataFrame(dictionary_name)
You can also provide lists or arrays to create dataframes, but you will have to
specify the column names as shown below.
pd.DataFrame(dictionary_name, columns = ['column_1', 'column_2'])
You can use the following command to load data into a dataframe from a csv
file:
pd.read_csv(filepath, sep=',', header='infer')
use the following code to change the row indices:
dataframe_name.index
To change the index while loading the data from a file,
you can use the attribute 'index_col':
pd.read_csv(filepath, index_col = column_number)
For column header, you can specify the column names using the following
code:
dataframe_name.columns = list_of_column_names
While working with Pandas, the dataframes may hold large volumes of data. It
would be an inefficient approach to load the entire data whenever an operation is
performed. Hence, you must use the following code to load a limited number of
entries:
dataframe_name.head()
dataframe.info(): This method prints information about the dataframe, which
includes the index data type and column data types, the count of non-null values and
the memory used.
dataframe.describe(): This function produces descriptive statistics for the
dataframe, that is, the central tendency (mean, median, min, max, etc.), dispersion,
etc. It analyses the data and generates output for numeric and non-numeric data types
accordingly.
The selection of rows in dataframes is similar to the indexing you saw in NumPy
arrays.
The syntax df[start_index:end_index] will subset the rows according to
the start and end indices.
You can select one or more columns from a dataframe using the following
commands:
df['column'] or df.column: It returns a series
df[['col_x', 'col_y']]: It returns a dataframe
You can use the loc method to extract rows and columns from a dataframe
based on the following labels:
dataframe.loc[[list_of_row_labels], [list_of_column_labels]]
You can use the following code to rename a column:
dataframe.rename(index={row_index: "new_name"}, columns={column_name:
"new_name"})
You can use the following code to set a multilevel index in a dataframe:
dataframe.set_index([column_1, column_2])
To obtain data from such dataframes, you have to provide the row details as a
tuple inside a list. You can go through the code provided below for reference:
dataframe.loc[[(label_1, sub_label_1), (label_1, sub_label_2)],
[column_label_1, column_label_2]]
You can use the following command to create pivot tables in Pandas:
df.pivot(columns='grouping_variable_col', values='value_to_aggregate',
index='grouping_variable_row')
Using the pivot_table() function, you can specify the aggregate function
you would want Pandas to execute over the columns provided. It could be the
same or different for each column in the dataframe.
df.pivot_table(values, index, aggfunc={'value_1': np.mean,'value_2': [min,
max, np.mean]})
You can use the following command to merge two dataframes:
dataframe_1.merge(dataframe_2, on = ['column_1', 'column_2'], how = '____')
The how attribute in the code above specifies the type of merge to be performed:
left: This will select the entries only in the first dataframe.
right: This will consider the entries only in the second dataframe.
outer: This takes the union of all the entries in the dataframes.
inner: This will result in the intersection of the keys from both frames.
You can add columns or rows from one dataframe to another using the
concat() function:
pd.concat([dataframe_1, dataframe_2], axis = _)