KEMBAR78
SnowFlake Notes | PDF | Databases | Data Type
100% found this document useful (1 vote)
5K views40 pages

SnowFlake Notes

Snowflake offers 4 user roles with varying levels of access. It uses schemas and warehouses similarly to a traditional data warehouse. Stages are used to temporarily store external data before loading it. Snowflake is fully managed, secure, and scalable in the cloud. It uses a shared-nothing architecture with Amazon S3 for storage and auto-scales compute clusters.

Uploaded by

Shivani Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
5K views40 pages

SnowFlake Notes

Snowflake offers 4 user roles with varying levels of access. It uses schemas and warehouses similarly to a traditional data warehouse. Stages are used to temporarily store external data before loading it. Snowflake is fully managed, secure, and scalable in the cloud. It uses a shared-nothing architecture with Amazon S3 for storage and auto-scales compute clusters.

Uploaded by

Shivani Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

SnowFlake Notes

SnowFlake Web UI - Essentials Data Warehousing

User Roles
● There are 4 user roles available in Snowflake.
○ SYSADMIN - default role
○ PUBLIC - with same options as SYSADMIN
○ ACCOUNTADMIN - additional options for accounts and notifications
○ SECURITYADMIN - additional options for accounts

Points to consider
● RDBMS: Databases with tables made up of rows and columns. These tables are
situated inside schemas.

● Schemas: They are organisational buckets. One schema can have multiple
tables.

● A user might have access to one schema but not another schema in the same
database. Also, a user might have access to one object in a schema but not
others.

● We access a table by: “DatabaseName”.”SchemaName”.”TableName”

● View: A saved SELECT statement that we can use as it were a table. It is a named
definition of a query.
2

Warehouse
● Warehouse is a collection of computing power which is used for loading,
unloading and querying data. It does NOT store/contain data. It’s not a place.

Real World Traditional Data Snowflake


Warehouse Term Warehouse Term Alternate Term

Goods Data Storage

Workers Servers Compute

● Each warehouse has a single cluster of servers. The number of servers in that
cluster vary based on the warehouse’s size designation. Choosing the warehouse
size is the same as choosing the number of servers in the cluster.
● All Snowflake warehouses have access to all the data, all the time.
● Smaller warehouses cost less while running for the same amount of time as a
larger warehouse. Use ‘Auto Suspend’ to control costs.
● Scaling Up - Moving to a larger warehouse
● Scaling Down - Moving to a smaller warehouse

Elastic Data Warehousing


● Optimizes data warehouse usage.
● Snowflake Standard and Premier Editions do NOT include this option.
● Has two additional components:
○ Multi-Cluster Options
○ Scaling Policy Options
● A Scaling policy helps Snowflake decide when to move from a single cluster to
more than one and when to snap back to a single cluster.

Staging
● Stages or Staging Areas are places to put things temporarily before moving
them to a more stable location. Eg.- Cloud folders or Directories
● There are two types of stages:
○ Internal Stages - act mostly like directories inside a Snowflake Account’s
local storage.
○ External Stages - act mostly like a secure gateway between Cloud Storage
Services and Snowflake Services.
● We can view the stages by: ‘list @stage_name’

Data Type Trends


3

● Unstructured: Images, PDFs, Videos


● Quasi-Structured: Clickstream
● Semi-Structured: XML, JSON, Parquet, AVRO, ORC
● Structured: RDBMS databases, CSV files

Training Sessions

Database
Collection of organised data for querying and transactional processing

Data warehouse
Collection of organised data from multiple sources for analytical processing rather than
transactional processing

Features
● Snowflake is a cloud data warehouse that runs entirely on cloud infrastructure
and cannot be run on private cloud or hosted infrastructure.
● It is available on AWS, Azure and Google Cloud.
● No PK/FK constraints since it is not a relational database.
● SQL commands: DDL, DML, Stored Procedures (JS), UDF, SQL functions
(** Stored procedures are prepared codes which can be saved and reused)
● Views - Materialized and Non-Materialized
● ACID Transactions - Concurrency and Workload Separation
● Database and Object Cloning
● Database failover and replication facility between accounts
● Continuous data protection - Time Travel and Fail Safe

Integration Support
● Self-Service BI: Tableau, QlikView
● Big Data Tools: Kafka, Spark, Databricks
● JDBC/ODBC Drivers: for connecting Java applications with the SQL Databases
● Languages: Python, Go, Node.js
4

Unique Offerings
● Scalability - both Storage and Compute
● User Experience - No indexing, no performance tuning, no partitioning, no
physical storage design
● Tunable pay per use - adjust costs

Storage and Compute


● Database: It is the storage layer. Create database, schemas, tables and load data.
Data is stored in Amazon S3 buckets.
● Virtual Data Warehouse: It can be thought of as a virtual machine. It is a
collection of computing power.
● As we run queries, the compute cost gets attached to the database, otherwise
only storage cost is attached.

Size of Data Warehouse


● The various sizes available are:
○ X-Small: Single Node
○ Small: Two Nodes
○ Medium: Four Nodes
○ Large: Eight Nodes
○ X-Large: Sixteen Nodes
● They can be thought of as Amazon EC2 instances.
● We can use a single node for DDL queries, four nodes for Loading data, 2XL for
Processing tasks or we can run multiple concurrent queries in a single 2XL.
● If all the compute power gets utilised, the queries are queued while using a single
cluster, but if we create a multi-cluster VW, it enables auto-scaling and it
launches another VW and shut it down once used.
5

Architecture

It has 3 layers - Service layer, Compute layer and Storage layer. Service layer has a fixed
cost while Compute and Storage layer uses Pay per usage method.

Snowflake Architecture is a hybrid of Shared-Disk and Shared-Nothing Architecture. It


uses shared S3 storage and processes the queries in multiple parallel processing
compute clusters.
6

Snowflake Objects

Why Snowflake?
It uses Amazon S3 as its Storage Layer.

Challenges with S3:


7

● High I/O Latency


● High CPU Overhead
● Object Storage - Overwrite only (Cannot append or update)

Features of S3:

● High Availability
● Durability
● Uses API to read data parts (range based)

S3 does not read in unnecessary volumes, it reads data in parts. It creates micro
partitions of uncompressed data, reorganises it into columnar format (column values
are stored together) and compresses them individually. It then adds headers to it which
contains offsets (metadata) for easy access and stores in S3.
8

Tables

Permanent Temporary Transient External

● Persists until ● Persists ● Persists until ● Persists


dropped and tied to dropped until
● Designed for a session ● Multiple user removed
data that (think of ● Used for data ● Snowflake
requires the single user) that needs to over an
highest level ● Used for persist, but external
of data transitory does not need data table
protection data (eg - the same level ● Data
and recovery ETL/ELT) of data accessed
● Default table retention as a via an
type permanent external
table. stage
● Read-only

Time Travel - Upto Time Travel - 0 to Time Travel - 0 to 1 No time travel


90 days with 1 day day
enterprise

Has fail safe period No fail safe No fail safe No fail safe

Create Like.. option creates another table with the same metadata, while clone option
creates a copy of the entire table.
Similar to the other table types (transient and permanent), temporary tables belong to a
specified database and schema; however, because they are session-based, they aren’t
9

bound by the same uniqueness requirements. This means you can create temporary
and non-temporary tables with the same name within the same schema.

However, note that the temporary table takes precedence in the session over any other
table with the same name in the same schema.

To create a temporary table, simply specify the TEMPORARY keyword (or TEMP
abbreviation) in CREATE TABLE. For example:

create temporary table mytemptable (id number, creation_date date);

Transient tables are similar to permanent tables with the key difference that they
do not have a Fail-safe period. As a result, transient tables are specifically
designed for transitory data that needs to be maintained beyond each session (in
contrast to temporary tables), but does not need the same level of data
protection and recovery provided by permanent tables.
To create a transient table, schema, database, simply specify the TRANSIENT keyword
when creating the object. For example, to create a transient table:

create transient table mytranstable (id number, creation_date date);

The Time Travel retention period for a table can be specified when the table is created
or any time afterwards. Within the retention period, all Time Travel operations can be
performed on data in the table (e.g. queries) and the table itself (e.g. cloning and
restoration). The Fail-safe period is not configurable for any table type.
Because transient tables do not have a Fail-safe period, they provide a good option for
managing the cost of very large tables used to store transitory data. It is recommended
to use transient tables only for data that does not need to be protected against failures
or data that can be reconstructed outside of Snowflake.
External tables are read-only, therefore no DML operations can be performed on them;
however, external tables can be used for query and join operations. Views can be
created against external tables. Querying data stored external to the database is likely
10

to be slower than querying native database tables; however, materialized views based
on external tables can improve query performance.

Views
A view is a named definition of a query. Snowflake supports two types of views:

● Non-Materialised View: A non materialised view’s results are created by


executing the query at the time that the view is referenced in a query. The results
are not stored for future use. It is dynamic and is the default view.
● Materialised View: Results are stored as though the results were a table. It
allows faster access, but requires storage space and active maintenance, both of
which require additional costs. It is static. It is created by specifying the
materialized keyword. It can query only a single table, Joins (including self-joins)
are not supported. A materialized view cannot query a materialized view, a non
materialized view or a UDTF. A materialized view cannot include UDFs, window
functions, Having, Order By, Limit clause, Group By keys that are not within the
Select list.

Virtual Warehouses
● A warehouse is a cluster of compute resources. It provides required resources
such as CPU, memory, and temporary storage to perform operations.
● The larger the warehouse, the more resources it has.
● While running, a virtual warehouse consumes Snowflake credits.

● Auto Suspend: By default, auto-suspend is enabled. Snowflake automatically


suspends the warehouse if it is inactive for the specified period of time.
11

● Auto Resume: By default, auto-resume is enabled. Snowflake automatically


resumes the warehouse when any statement that requires a warehouse is
submitted and the warehouse is the current warehouse for the session.
● Multi-Cluster Warehouse:
○ A virtual warehouse consists of a single cluster of servers that determines
the total resources available to the warehouse for executing queries.
○ With multi-cluster warehouses, Snowflake supports allocating, either
statically or dynamically, a larger pool of resources to each warehouse.
○ A multicluster warehouse is defined by specifying: Maximum number of
server clusters (greater than 1, upto 10), Minimum number of server
clusters, equal to or less than the maximum.
○ In the maximised feature, the minimum cluster equals the maximum
cluster.
○ In the auto-scaling feature, the minimum cluster is less than the maximum
cluster.
● Scaling Up: Used when we want to run same complex queries on more servers
○ Adding more servers to the cluster
○ Improves query performance by processing more data
○ Runs more complex queries
○ It is a manual process
● Scaling Out: When we are running a lot of complex queries on a small compute
cluster, the queries get queued, we can then scale out to run concurrent queries.
○ Spin up more compute clusters
○ Supports more users
○ More queries simultaneously
● Scaling Policy: Standard policy starts another cluster as soon as queuing starts.
In Economy policy, it examines whether there is enough load to keep the
additional cluster busy for 6 minutes, then only it starts that cluster.

Access Control
● Determines who can access what and perform which operations on those
specified objects.
● Securable object: Entity to which access is granted.
● Role: Entity to which privileges are granted. Roles are then assigned to users.
● Privilege: level of access to an object.
● User - Snowflake user
● Snowflake has a combination of:
○ Discretionary Access Control (DAC) - owner has the privileges for the
objects, who can then grant access to others
○ Role-based Access Control (RBAC) - privileges are assigned to roles, and
roles are then assigned to users
● Roles:
○ Entities to which privileges on securable objects can be granted and
revoked
12

○ Assigned to users to allow them to perform actions required for business


functions in their organisation
○ A user can be assigned multiple roles
○ Types - System defined roles & Custom Roles
○ System Defined Roles:
■ AccountAdmin: manages all aspect of the account
■ SysAdmin: creates and manages databases and warehouses
■ SecurityAdmin: manages security aspects of the account
■ UserAdmin: creates and manages users and roles
■ Public: automatically available to every user in the account
○ When creating your own roles, recommendation is to create a hierarchy of
roles.
○ If the Public has certain privileges, all other roles will have them. However,
the reverse is untrue.
○ Also, if there are certain privileges with Sysadmin, it doesn’t mean
securityadmin will have the same privileges. Privileges with useradmin can
be accessed by Security Admin and account admin and not sysadmin.

● Enforcement Model:
○ Every session has a ‘current role’, regardless of access mechanism.
○ Here’s how role is determined:
■ If specified during connection
■ If not specified, default role
■ If not specified and absence of default role, then Public
○ We can override the current role using the USE ROLE <Role_Name>
command.
○ There are no Root users (with full access) in Snowflake.
13

Data Types
Snowflake supports most SQL data types:

● Numeric Data Types - Number, Decimal, Numeric, Integer, Float, Double


● String & Binary Data Types - Varchar, Char, String, Text, Binary
● Logical Data Types - Boolean
● Date & Time Data Types - Date, Datetime, Time, Timestamp
● Semi-Structured Data Types - Variant, Object, Array
● Geospatial Data Types - Geography

Structured Data Types have a predefined schema while semi-structured data types lack
a predefined structure.

Variant Data Type:

● Can store values of any other data type including array and object
● Max size is 16MB compressed
● Any value can be implicitly given to the variant type
● Variant columns in a relational table are stored as separate physical columns
● Non-native data types such as dates and timestamps are stored as strings when
loading into a variant column

Object Data Type:

● Used to represent collections of key-value pairs, where the Key is non-empty


string, where Value is a value of Variant type

Array Data Type:

● Used to represent dense or sparse arrays of arbitrary size, where index is a non-
negative integer (upto 2^31-1), and values have variant data type

Semi-Structured Functions:

● JSON & XML Parsing:


○ CHECK_JSON
○ CHECK_XML
○ PARSE_JSON
○ PARSE_XML
● Array/Object Creation and Manipulation:
○ ARRAY_CONSTRUCT
○ ARRAY_CONTAINS
○ ARRAY_SIZE
○ OBJECT_CONSTRUCTS
● Extraction:
○ FLATTEN
14

○ OBJECT_KEYS
● Conversion/Casting
● Type Predicates

Flatten:

● Flattens (explodes) compound values into multiple rows.


● Flatten is a table function that takes a Variant, Object or Array column and
produces a lateral view.
● It can be used to convert semi-structured data to a relational representation.
● Syntax:
FLATTEN( INPUT => <expr> [ , PATH => <constant_expr> ]

[ , OUTER => TRUE | FALSE ]

[ , RECURSIVE => TRUE | FALSE ]


[ , MODE => 'OBJECT' | 'ARRAY' | 'BOTH' ] )

The returned rows consist of a fixed set of columns:


SEQ | KEY | PATH | INDEX | VALUE | THIS

Data Movement
● Copy Command
● File Formats
● Stages
○ Internal
○ External
● Loading/Unloading from local file system
● Loading/Unloading from external location

Copy Command
Used to load data from staged files on internal or external locations to an existing table
or vice versa.

Two types:

● Copy into <table> - Used to load data from internal/external stage to table
● Copy into <location> - Used to unload data from table into internal/external stage

https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
15

File Format
● A named file format object provides a convenient means to store all of the
format information required for loading data files into tables.
● We can set a wide variety of parameters to set attributes such as Compression,
File delimiters, Skipping headers, Date/Time format, Encoding, Null Handling, etc
for the data stored in staged files before ingestion.
● By specifying a named file format object(or individual file format options) for the
stage, it isn’t necessary to later specify the same file format options in the COPY
command used to load data from the stage.
CREATE [ OR REPLACE ] FILE FORMAT [ IF NOT EXISTS ] <name>

TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [


formatTypeOptions ]
[ COMMENT = '<string_literal>' ]

https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

Stages
An intermediary space where we can upload the files so that we can use the COPY
command to load or unload files.

Types:

● Internal Stage
○ User Stage
○ Table Stage
○ Named Internal stage
■ Temporary
■ Permanent
● External Stage

Internal Stage:

User Stage Table Stage Named Internal


Stage

What Each user has a Each table has a Internal stages are
stage allocated to stage allocated to it named database
them by default for by default for objects that we can
storing files. storing files. use in-place of user
and table stage.
16

They are
recommended
stages to load the
tables.

When to use Convenient if our Convenient if our The internal named


files will only be files need to be stages can be
accessible by a accessible to accessed by
single user, but multiple users only multiple users and
need to be copied and need to be can be used to load
into multiple tables. copied into a single multiple tables.
table. Table stage
name is the same
as our table name.

Supports File No No Yes


Format Option

How to reference list @~ list @%table_name list @stage_name

Loading data from Local File System


17

● For user and table stages, data can be loaded via snow cli only, but for named
stages, we can use console too, but using console we load only files upto 50 MB.

Implementation:

USE DATABASE GEM_DB;

CREATE OR REPLACE TABLE patient_internal (id INT,

pname VARCHAR,

drug VARCHAR,

gender VARCHAR,

age INT);

CREATE STAGE patient_stage File_Format = (type = 'CSV' field_delimiter = ',' skip_header = 1);

CLI - put file:///home/shivani/Documents/patient.csv @patient_stage;

copy into PATIENT_INTERNAL from @patient_stage;

select * from PATIENT_INTERNAL;

Unloading data to Local File System:


18

External Stages

● Storage locations outside the snowflake environment in another cloud storage location.
● External stages store the files in an external location that is referenced by the stage. An
external stage specifies location and credential information, if required.
● Cloud storage services supported are: Amazon S3 buckets, Google cloud storage
buckets and microsoft azure containers.
CREATE [ OR REPLACE ] [ TEMPORARY ] STAGE [ IF NOT EXISTS ]
<external_stage_name>

externalStageParams

[ FILE_FORMAT = ( { FORMAT_NAME = '<file_format_name>' | TYPE = { CSV | JSON


| AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] ) } ]

[ COPY_OPTIONS = ( copyOptions ) ]
[ COMMENT = '<string_literal>' ]

Loading data from external stage:


19

Bulk vs Continuous Loading

Bulk Loading Continuous Loading

What Enables loading batches of Designed to load small volumes


data from files already of data (i.e. micro-batches) and
available on cloud storage, or incrementally make them
staging data files from a local available for analysis. Snowpipe
machine to an internal (i.e. loads data within minutes after
Snowflake) cloud storage files are added to a stage and
location before loading the submitted for ingestion.
data into tables using COPY
command.

Compute Relies on user-provided virtual Uses compute resources


Resource warehouses, which are provided by Snowflake (i.e a
specified in the copy serverless compute model).
command.

Simple ● Column reordering The COPY statement in a pipe


Transformations ● Column omission definition supports the same
● Casts COPY transformation options as
● Truncating text strings when bulk loading data.
that exceed the target
column length.

Query Profile
It provides execution details for a query. It provides graphical representation of the
main components of the processing plan for the query, with statistics for each
component.

It has following elements:

● Steps: If the query was processed in multiple steps, you can toggle between each
step.
● Operator Tree: The middle pane displays a graphical representation of all
operator nodes for the selected step, including the relationship between each
operator node.
● Node List: The middle pane includes a collapsable list of operator nodes by
execution time.
● Overview: The right pane displays an overview of the query profile. The display
changes to operator details when an operator node is selected.
20

Caching

Types of Cache:

● Result Cache: If the user fires the same query that has already been run,
snowflake will return the same result. But the following conditions need to be
met for the reuse of result set:
○ The user must have necessary privileges on the tables that are used in the
result query.
○ The new query should match with the previous one.
○ The data has not changed
● Metadata Cache:
○ Metadata about Tables and Micro-partitions are collected and managed
by snowflake automatically.
○ Snowflake doesn’t use compute to provide range values like Min, Max,
Number of Distinct Values, NULL count, and Row count and clustering
information.
● Warehouse Cache:
○ Every warehouse has attached SSD storage.
○ This has been used to store micro-partitions that are pulled from the
storage layer. While processing the query, the SSD is scanned first and
then data is pulled from storage level.
-------------Caching---------------

USE DATABASE SNOWFLAKE_SAMPLE_DATA;

USE SCHEMA TPCH_SF10000;


21

SELECT R_NAME, N_NAME,

SUM(O_TOTALPRICE) AS TOTALPRICE,

COUNT(*) AS COUNT_ROWS

FROM ORDERS O JOIN CUSTOMER C

ON C_CUSTKEY = O_CUSTKEY

JOIN NATION N

ON N_NATIONKEY = C_NATIONKEY

JOIN REGION R

ON R.R_REGIONKEY = N_REGIONKEY

WHERE O.O_ORDERDATE = '1998-07-21'

GROUP BY R_NAME, N_NAME

ORDER BY R_NAME, N_NAME;

------------Result Cache-----------

ALTER SESSION SET USE_CACHED_RESULT = FALSE;

-----------WAREHOUSE CACHE---------

ALTER WAREHOUSE COMPUTE_WH SUSPEND;

-----------METADATA CACHE----------

SELECT MIN(O_TOTALPRICE) FROM ORDERS;

Micro-Partitions
● Traditional data warehouses rely on static partitioning of large tables to achieve
acceptable performance and enable better scaling.
● Static partitioning has a number of well-known limitations, such as maintenance
overhead and data skew, which can result in disproportionately-sized partitions.
● In Snowflake, all data in tables is automatically divided into micro-partitions,
which are contiguous units of storage.
● Snowflake is columnar-based and horizontally partitioned, meaning a row of
data is stored in the same micro-partition.
22

● Each micro-partition contains between 50 MB and 500 MB of uncompressed


data.
● This size and structure allows for extremely granular pruning of very large tables,
which can comprise of millions, or hundreds of millions, of micro-partitions.

Benefits:

● Snowflake micro-partitions are derived automatically.


● Micro-partitions are small in size (50 to 500 MB, before compression), which
enables extremely efficient DML and fine-grained pruning for faster access.
● Micro-partitions can overlap in their range of values, which, combined with their
uniformly small size, helps prevent skew.
● Columns are stored independently within micro-partitions, often referred to as
columnar storage. This enables efficient scanning of individual columns; only the
columns referenced by the query are scanned.
● Columns are compressed individually within micro-partitions.

Clustering
● As data is loaded/inserted into a table, clustering metadata is collected for each
micropartition, created during the process. Snowflake then leverages this
clustering information to avoid unnecessary scanning of micro-partitions.
● To improve the clustering of underlying table micro-partitions, you can always
manually sort rows on key table columns and reinsert them into the table;
however, performing these tasks can be cumbersome and expensive.
23

● Instead, Snowflake supports automating these tasks, by designating one or more


table columns/expressions as a clustering key for the table. A table with a
clustering key defined is considered to be clustered.
● A clustering key is a subset of columns in a table (or expressions on a table) that
are explicitly designed to co-locate the data in the table in the same micro-
partitions.
● This is useful for very large tables where the ordering was not ideal (at the time
the data was inserted/loaded) or extensive DML has caused the table’s natural
clustering to degrade.

Benefits:
24

● Improved scan efficiency in queries by skipping data that does not match
filtering predicates.
● After a key has been defined on a table, no additional administration is required,
unless you choose to drop or modify the key. All future maintenance on the rows
in the table (to ensure optimal clustering) is done by snowflake automatically.
● Typically, queries benefit from clustering when the queries are filtered or sorted
on the clustering key for the table.

Note:

● Clustering can substantially improve the performance and reduce the cost of
some queries, but compute resources are used to perform clustering, which
consumes credits.
● Clustering keys are not intended for all the tables. The size as well as query
performance of a table should dictate whether to define a clustering key for the
table.
● In general, tables in the multi-terabyte (TB) range will experience the most benefit
from clustering, particularly if DML is performed regularly/continually on these
tables.

Pricing
Snowflake’s pricing model includes only two items: storage and compute resources.

All charges are usage-based.

The charge for storage is per terabyte, compressed, per month. For the US, snowflake’s
storage costs can begin at a flat rate of $23/TB, average compressed amount, per
month, accrued daily.

The charge for compute is based on the processing units, which we refer to as credits.
Compute charges are billed on actual usage, per second. For the US, compute costs
$0.00056 per second, per credit, for Snowflake On demand Standard Edition.

A Snowflake Credit is a unit of measure, and it is consumed only when a customer is


using resources, such as when a virtual warehouse is running, the cloud services layer is
performing work, or serverless features are used.

Pricing for Compute/Warehouse resourcing

● There is one-to-one correspondence between the number of servers in a


warehouse cluster and the number of credits billed for each full hour that the
warehouse runs.
● Warehouses are only billed for credit usage when they are running. When a
warehouse is suspended, it does not accrue any credit usage.
25

● Credits are billed per second, with a 60 second (1 min) minimum. Even if we use
it for the first 30 seconds, for the first minute, we need to pay for the full minute,
for the next minute, it is billed per second.
● Each time a warehouse is started or resized to a larger size, the warehouse is
billed for 1 minute’s worth of usage.
● After 1 minute, all subsequent billing is per-second.
● Stopping and restarting a warehouse within the first minute does not change the
amount billed; the minimum billing charge is 1 minute.

T
26

Pricing for Data Storage

Usage for data storage is calculated on the daily average amount of data (in bytes)
stored in the system for:

● Files staged in snowflake locations (i.e. user and table stages or internal named
stages) for bulk data loading/unloading (can be stored compressed or
uncompressed).
● Database tables including historical data or Time travel (always compressed by
Snowflake)
● Fail safe for database tables (always compresses by Snowflake)
● Clones of database tables that reference data deleted in the table that owns the
clones.

Pricing for Cloud Service Usage

Usage for cloud services is charged only if the daily consumption of cloud services
exceeds 10% of the daily compute usage.

If total compute credits consumed in a day is 100 and total cloud service credits used is
12, then total credits billed is 100 + 2 = 102.

Time Travel & Fail Safe


Snowflake Time Travel enables accessing historical data (i.e data that has been
changed or deleted) at any point within a defined period.
27

Using Time Travel, you can perform the following actions within a defined period of
time:

● Query data in the past that has since been updated or deleted.
● Create clones of entire tables, schemas, and databases at or before specific
points in the past.
● Restore tables, schemas, databases that have been dropped.

Once the time period has elapsed, the data is moved to Snowflake Fail Safe and these
actions can no longer be performed.
Time Travel SQL Extensions:
To support Time Travel, the following SQL extensions have been implemented:

● AT | BEFORE clause which can be specified in SELECT statements and CREATE …


CLONE commands (immediately after the object name).
● The clause uses one of the following parameters to pinpoint the exact historical
data you wish to access:
● TIMESTAMP
● OFFSET (time difference in seconds from the present time)
● STATEMENT (identifier for statement, e.g. query ID)
28

Fail Safe:

● A means to recover your lost data in the event of a system failure or other
catastrophic event, e.g. a hardware failure or security breach
● Done by Snowflake (not self-service)
● Not to be used as a mode to access historical data
● Recovery may take several hours or few days
● Fail safe Period starts immediately after the time travel retention period ends.

Users with the ACCOUNT ADMIN role can set DATA_RETENTION_TIME_IN_DAYS to 0 at


the account level, which means that all databases (and subsequently all schemas and
tables) created in the account have no retention period by default; however, this default
can be overridden at any time for any database, schema, or table.
29

Currently, when a database is dropped, the data retention period for child schemas or
tables, if explicitly set to be different from the retention of the database, is not honored.
The child schemas or tables are retained for the same period of time as the database.

To honor the data retention period for these child objects (schemas or tables), drop
them explicitly before you drop the database or schema.

If an object with the same name already exists, UNDROP fails. You must rename the
existing object, which then enables you to restore the previous version of the object.

Data Share & Marketplace


Traditional Data Share:

Provider

● Extract the data from Source System


● Encrypt the data
● Share the data to respective data consumers using FTP/API/Cloud Storage

Consumer

● Handling increase data size


● Decrypt sensitive data
● Changing file formats and schema
● Cleaning data

Snowflake:

Provider

● Share live data with ease


● No duplicate data copies
● Eliminate data movement
● Provide controlled, customized views

Consumer:

● Always up to data
● No ETL
● No storage
● Pay only for what you use

Share:
30

Share the data from one account and use the shared data from another account.

No actual data is copied or transferred between accounts.

All sharing is accomplished through snowflake’s unique services layer and metadata
store.

Shared data does not take up any storage in a consumer account and thus it does not
contribute to the consumer’s monthly data storage charges.

Shares are named Snowflake objects that encapsulate all of the information required to
share a database. Each share consists of:

● The privileges that grant access to the databases, and the schema containing the
objects to share.
● The privileges that grant access to the specific objects in the database.
● The consumer accounts with which the database and its objects are shared.

Data Providers:

● Any snowflake account that creates data shares and makes them available to
other snowflake accounts to consume. Providers can share with one or more
snowflake accounts.
● For each database you share, Snowflake supports using grants to provide
granular access control to selected objects in the database.
● Snowflake does not place any hard limits on the number of shares you can create
or the number of accounts you can add to a share.

Data Consumers:

● Any account that chooses to create a database from a share made available by a
data provider.
● Once customers add a shared database to your account, you can access and
query the objects in the database just as you would with any other database in
your account.
● Snowflake does not place any hard limits on the number of shares you can
consume from data providers; however, you can only create one database per
share.

Reader Accounts:

Data sharing is only supported between snowflake accounts. If the consumer doesn’t
have a Snowflake account and/or is not ready to become a licensed Snowflake
consumer, to facilitate data sharing with these consumers, Snowflake supports
providers by creating reader accounts.
31

Similar to standard snowflake accounts, the provider account uses shares to share
databases with reader accounts; a reader account can only consume data from the
provider account that created it.

● Data sharing is only supported between accounts in the same snowflake region.
● Snowflake allows sharing the data between different regions when the data is
replicated in consumers' regions.
● Shared databases are read-only
● Following actions are not supported:
○ Creating a clone of a shared database or any schemas/tables in the
database.
○ Time travel for a shared database or any schemas/tables in the database.
○ Editing the comments for a shared database.
● Shared database and all the objects in the database cannot be re-shared with
other accounts.

Marketplace

Utilizes snowflake secure data sharing to connect providers of data to customers.

You can discover and access a variety of third-party data and have those datasets
available directly in your snowflake account to query without transformation and join it
with your own data. If you need to use several different vendors for data sourcing, the
Data Marketplace gives you one single location from where to get the data.

Implementation:
Creating internal stages and loading data:
USE DATABASE GEM_DB;

CREATE OR REPLACE TABLE patient_internal (id INT,

pname VARCHAR,

drug VARCHAR,

gender VARCHAR,

age INT);

CREATE STAGE patient_stage File_Format = (type = 'CSV' field_delimiter = ',' skip_header = 1);

(Using CLI) put file:///home/shivani/Documents/patient.csv @patient_stage;


32

copy into PATIENT_INTERNAL from @patient_stage;

select * from PATIENT_INTERNAL;

Creating tables and stages on Amazon S3 Bucket:


Creating File Formats:

create or replace file format my_csv_format

type = csv

record_delimiter = '\n'

field_delimiter = ','

skip_header = 1

null_if = ('NULL', 'null')

empty_field_as_null = true
FIELD_OPTIONALLY_ENCLOSED_BY = '0x22' --- for rows with commas in
values

create or replace table Product (id INT,

productName varchar,

customerName Varchar,

param1 int,

param2 int,

productType varchar);

create or replace stage snowstage

url = 's3://mysnowbucket16/product.csv'

credentials=(

AWS_KEY_ID='********',
AWS_SECRET_KEY='********');

Copy into Product from @snowstage file_format='my_csv_format';

select * from PRODUCT;


33

Loading JSON file into internal table from AWS S3:


CREATE OR REPLACE TABLE Quiz_JSON (js variant);

Copy into QUIZ_JSON From @snowstagejson File_format=(Type='JSON'


STRIP_OUTER_ARRAY = TRUE);

Select js:quiz:sport from QUIZ_JSON;

Creating Warehouses using SQL command:


CREATE OR REPLACE warehouse small_wh with

warehouse_size = "X-SMALL"

auto_suspend = 180

auto_resume = true

initially_suspended = true;

USE WAREHOUSE small_wh;

Creating Non Materialized Views:


SELECT productname, customername FROM GEM_DB.PUBLIC.PRODUCT;

Create View VW_Prod as

SELECT productname, customername FROM GEM_DB.PUBLIC.PRODUCT;

Select * From VW_Prod Where customername = 'Barry French';

---- With non-materialized view, running the select statements will require the lot of time
again and again, since it does not store any cache as output -----

Creating Materialized Views:


Create Materialized View MV_Prod as

SELECT productname, customername FROM GEM_DB.PUBLIC.PRODUCT;

Select * From MV_Prod Where customername = 'Barry French';


34

-------- Materialized views store caches and so, it takes less time to run queries when they
are re-run. --------

Creating UDFs using SQL:


CREATE OR REPLACE FUNCTION mul (v1 FLOAT, v2 FLOAT)

RETURNS FLOAT

AS '

CASE

WHEN v2 || v1 IS NULL THEN 0

ELSE

v1 * v2

END

';

Select mul(7, 8);

Creating Javascript UDFs:


CREATE OR REPLACE FUNCTION validateID(ID FLOAT)

RETURNS VARCHAR

LANGUAGE Javascript

AS $$

try {

if (ID<0) {

throw "ID cannot be negative";

else {

return "Validated ID";

}
35

catch (err){

return "Error: " + err;

$$;

Create Table Student(id int);

Insert into Student(id) Values (44), (-44);

Select id, validateID(id) From Student Order By id;

Flatten Implementation
select * from table(flatten(input => parse_json('[1, ,77]'))) f;

------Effect of Path------

select * from table(flatten(input => parse_json('{"a":1, "b":


[77,88]}'), outer => true)) f;

select * from table(flatten(input => parse_json('{"a":1, "b":


[77,88]}'), path => 'b')) f;

------Effect of Outer------

select * from table(flatten(input => parse_json('[]'))) f;

select * from table(flatten(input => parse_json('[]'), outer => true))


f;

------Effect of Recursive------

select * from table(flatten(input => parse_json('{"a":1, "b":[77,88],


"c": {"d":"X"}}'))) f;

select * from table(flatten(input => parse_json('{"a":1, "b":[77,88],


"c": {"d":"X"}}'), recursive => true )) f;
36

------Effect of Mode------

select * from table(flatten(input => parse_json('{"a":1, "b":[77,88],


"c": {"d":"X"}}'), recursive => true, mode => 'object' )) f;

------Practical Example that explodes an array nested within another


array-------

create or replace table persons as

select column1 as id, parse_json(column2) as c

from values

(12712555,

'{ name: { first: "John", last: "Smith"},

contact: [

{ business:[

{ type: "phone", content:"555-1234" },

{ type: "email", content:"j.smith@company.com" } ] } ] }'),

(98127771,

'{ name: { first: "Jane", last: "Doe"},

contact: [

{ business:[

{ type: "phone", content:"555-1236" },

{ type: "email", content:"j.doe@company.com" } ] } ] }') v;

Select * from Persons;

-- Note the multiple instances of LATERAL FLATTEN in the FROM clause


of the following query.
37

-- Each LATERAL view is based on the previous one to refer to


elements in

-- multiple levels of arrays.

select id as "ID",

f.value as "Contact",

f1.value:type as "Type",

f1.value:content as "Details"

from persons p,

lateral flatten(input => p.c, path => 'contact') f,

lateral flatten(input => f.value:business) f1;

select distinct id as "ID",

f.this:first as "First",

f.this:last as "Last"

from persons p,

lateral flatten(input => p.c, path => 'name') f;

Time Travel
create database new_db;

create schema new_sc;

create table user_tt(id number(10), name varchar(20));

insert into user_tt values(1, 'user1'), (2, 'user2'), (3, 'user3');

-----current data------

select * from user_tt;


38

-------updated the data------

update user_tt set name = 'new_user1' where id = 1;

-----current data------

select * from user_tt;

-------old data 2 min before--------

select * from user_tt before(offset => -60*2);

select * from user_tt before(timestamp => 'Tue, 17 Aug 2021 11:35:00


+0530'::timestamp_tz);

select * from user_tt at(statement => '019e52a5-0000-16fe-0001-


6e5600019a7e');

------cloning table------

create table new_user clone user_tt

before(offset => -60*2);

select * from new_user;

------cloning schema------

create schema new_schema clone new_sc

before(offset => -60*5);

------cloning db--------

create database new_db_clone clone new_db


39

before(offset => -60*5);

truncate table user_tt;

drop table new_user;

undrop table new_user;

drop schema new_schema;

undrop schema new_schema;

select * from new_schema.user_tt;

drop database new_db_clone;

undrop database new_db_clone;

select * from new_db_clone.new_sc.user_tt;


40

You might also like