KEMBAR78
4220 5 (Python) | PDF | Computer File | Python (Programming Language)
0% found this document useful (0 votes)
65 views12 pages

4220 5 (Python)

The document provides an overview of Python Crash Course covering basic Python and DataFrame manipulations in Databricks notebooks. It discusses how to import and open .ipynb and .py files in a notebook, perform basic file operations using DBFS commands, load and save data to/from Databricks, and describes simple manipulations and SQL queries that can be done on DataFrames.

Uploaded by

darren boesono
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views12 pages

4220 5 (Python)

The document provides an overview of Python Crash Course covering basic Python and DataFrame manipulations in Databricks notebooks. It discusses how to import and open .ipynb and .py files in a notebook, perform basic file operations using DBFS commands, load and save data to/from Databricks, and describes simple manipulations and SQL queries that can be done on DataFrames.

Uploaded by

darren boesono
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Python Crash Course

Instructor: Li Yang
Python

• This lecture will cover some basic manipulations of python

• Python is an interpreted high-level general-purpose programming


language. Its design philosophy emphasizes code readability with its
use of significant indentation.

• We use databricks notebook but you can also use other …


Databricks Notebook

• Notebook use .ipynb file and also .py file

• You can’t open .ipynb file or .py file directly and you need a notebook to
open it

• Click on workspace (left column)

• Right click your mouse and choose import option

• Browse your local computer and upload .ipynb file or .py file
Databricks File sytem (DBFS)

• First you need to type %fs in your notebook

• Basic manipulations (commonly used ones):

• ls command: lists the contents of a directory

• cp command: copy a file or directory

• mkdirs command: create a directory if it doesn’t exist

• mv command: move a file or directory

• rm command: remove a file or directory


Read data from Databricks

• If your data is not on databricks, you need to upload it first:

• Click on data button of left column

• Create a table

• Upload file

• Create a table with UI and choose a cluster to the table

• Preview table. Then open a notebook, type %fs ls FileStore/tables/

to check data on databricks


Dataframe

DataFrame is a 2-dimensional labeled data structure with columns of potentially


different types. You can think of it like a spreadsheet or SQL table, or a dict of
Series objects. It is generally the most commonly used data object in python

• Data processing’s 1st step: read data and build a dataframe for further use.

Read data from Databricks

• If your data is on databricks already:

• Open or create a new notebook (we choose python as our default language)

• Read data from DBFS to dataframe:

df = spark.read.format(file_type)\

.option("multiline","true")\

.option("inferSchema", “true”)\

.option("header", “true”)\

.load(file_location)

• Remark: readable file types are txt, csv, ldap, json, parquet, orc.
Write dataframe as file to Databricks

• Open or create a new notebook (we choose python as our default


language)

• Read data from DBFS to dataframe:

df.write.format(file_type)\

.option("header", “true”)\

.option("delimiter", “,”)\

.save(file_name)
Simple manipulations on dataframe

• df.cache(): cache dataframe for quick use

• display(df): illustration of dataframe

• df.printSchema(): details of columns: name, formats of columns,


nullability

• df.columns: names of columns

• df.describe().show(): summary of some statistics of each numeric column

• type(object): the data format of the object


Simple manipulations on dataframe

• df[[“column1”…”columnN”]]: create a new dataframe with N columns from a


dataframe

• df.withColumnRenamed(“old column name”, “new column name”): create a new


dataframe replacing an old column name by a new name

• df.withColumn(“column name”, df.[“column name”].cast(“data type”)): create a


new dataframe assigning the data type for the column

• df.withColumn(“new column name”, lit(default value)): create a new dataframe


adding an new column with a default value

• df.drop(“column name”): create a new dataframe dropping a column with the


column name

• df.createOrReplaceTempView(“SQL temporary view name”): create a SQL


temporary view for use
Simple SQL manipulations

• Basic structure of SQL:

• Structure 1: return a new table

Select column1, column2, … or * (all the columns)

from table (temporary view)

under conditions

• Structure 2: return some value of a new table

Select fun(column1), fun(column2), … or fun(*) (all the columns)

from table (temporary view)

under conditions

• Remark: column name can not include space. For example, column name, new age, is not allowed.
You need to change it to new_age or newage …
Simple SQL: where

• Basic structure of SQL:

• Structure 1: return a new table

select column1, column2, … or * (all the columns)

from table (temporary view)

where conditions

• Conditions:

• condition on a column:

• column >, <, =, <> number

• column in (string1… stringN)

• column like string

• Condition 1 and (or) Condition 2

You might also like