Python Crash Course
Instructor: Li Yang
Python
• This lecture will cover some basic manipulations of python
• Python is an interpreted high-level general-purpose programming
language. Its design philosophy emphasizes code readability with its
use of significant indentation.
• We use databricks notebook but you can also use other …
Databricks Notebook
• Notebook use .ipynb file and also .py file
• You can’t open .ipynb file or .py file directly and you need a notebook to
open it
• Click on workspace (left column)
• Right click your mouse and choose import option
• Browse your local computer and upload .ipynb file or .py file
Databricks File sytem (DBFS)
• First you need to type %fs in your notebook
• Basic manipulations (commonly used ones):
• ls command: lists the contents of a directory
• cp command: copy a file or directory
• mkdirs command: create a directory if it doesn’t exist
• mv command: move a file or directory
• rm command: remove a file or directory
Read data from Databricks
• If your data is not on databricks, you need to upload it first:
• Click on data button of left column
• Create a table
• Upload file
• Create a table with UI and choose a cluster to the table
• Preview table. Then open a notebook, type %fs ls FileStore/tables/
to check data on databricks
Dataframe
DataFrame is a 2-dimensional labeled data structure with columns of potentially
different types. You can think of it like a spreadsheet or SQL table, or a dict of
Series objects. It is generally the most commonly used data object in python
• Data processing’s 1st step: read data and build a dataframe for further use.
Read data from Databricks
• If your data is on databricks already:
• Open or create a new notebook (we choose python as our default language)
• Read data from DBFS to dataframe:
df = spark.read.format(file_type)\
.option("multiline","true")\
.option("inferSchema", “true”)\
.option("header", “true”)\
.load(file_location)
• Remark: readable file types are txt, csv, ldap, json, parquet, orc.
Write dataframe as file to Databricks
• Open or create a new notebook (we choose python as our default
language)
• Read data from DBFS to dataframe:
df.write.format(file_type)\
.option("header", “true”)\
.option("delimiter", “,”)\
.save(file_name)
Simple manipulations on dataframe
• df.cache(): cache dataframe for quick use
• display(df): illustration of dataframe
• df.printSchema(): details of columns: name, formats of columns,
nullability
• df.columns: names of columns
• df.describe().show(): summary of some statistics of each numeric column
• type(object): the data format of the object
Simple manipulations on dataframe
• df[[“column1”…”columnN”]]: create a new dataframe with N columns from a
dataframe
• df.withColumnRenamed(“old column name”, “new column name”): create a new
dataframe replacing an old column name by a new name
• df.withColumn(“column name”, df.[“column name”].cast(“data type”)): create a
new dataframe assigning the data type for the column
• df.withColumn(“new column name”, lit(default value)): create a new dataframe
adding an new column with a default value
• df.drop(“column name”): create a new dataframe dropping a column with the
column name
• df.createOrReplaceTempView(“SQL temporary view name”): create a SQL
temporary view for use
Simple SQL manipulations
• Basic structure of SQL:
• Structure 1: return a new table
Select column1, column2, … or * (all the columns)
from table (temporary view)
under conditions
• Structure 2: return some value of a new table
Select fun(column1), fun(column2), … or fun(*) (all the columns)
from table (temporary view)
under conditions
• Remark: column name can not include space. For example, column name, new age, is not allowed.
You need to change it to new_age or newage …
Simple SQL: where
• Basic structure of SQL:
• Structure 1: return a new table
select column1, column2, … or * (all the columns)
from table (temporary view)
where conditions
• Conditions:
• condition on a column:
• column >, <, =, <> number
• column in (string1… stringN)
• column like string
• Condition 1 and (or) Condition 2