Snowflake
Snowflake
Each micro-partition generally contains 50MB to 500 MB of uncompressed data. However, the
stored size is smaller as Snowflake data is always stored with compression. Within each micro-
Types of Stages:
So in this lecture, I would like to go through the high level view of the snowflake stages, what their
types are and how they are used.
Before I go into those details, I would like to familiarize yourself with this diagram here.
So.
In the center of the screen, we have the snowflake instance.
So this is basically the trial that you would have created.
So I've represented this here on the bottom.
We have the on premises system.
So often when you have a cloud based data warehouse, there is a lot of data coming from on premises
systems as well.
So I've represented those here.
And additionally there could be data coming from your cloud storage.
So S3 buckets and blob storage as well as Google Cloud storage.
So what I'm going to now do is overlay the types of stages in Snowflake over this diagram and hopefully
you're able to piece information together where which type of stage is used.
So snowflake stages can be classified into two high level types.
External stages and internal stages.
Now an external stage refers to a storage location that is outside of Snowflake.
External stages are commonly used to refer to storage locations on cloud platforms such as AWS, S3
Bucket or Azure Blob Storage or Google Storage.
Once an external stage has been defined, the copy command then can be used to load data from that
external
stage into a snowflake table.
Now, it is worth noting that an external stage is a reference to the cloud storage.
The stage itself does not store anything, but rather refers to the cloud storage.
Next.
We have the internal stages.
So unlike external stages.
Internal stages store data in themselves.
Snowflake
There are three types of internal stages, so named internal stages, which are stage objects that can
be created, dropped and modified as required.
Once data has been uploaded into an internal stage, it can then be loaded into a table using the copy
command.
Now, another type of an internal stage is a table stage, which is automatically created for each table
and can be used to load data into that table.
And finally we have the user stage.
So each user gets a personal stage as well, which is created as soon as the user is created.
User stages are also a subtype of an internal stage.
Now, regardless of which internal stage you use, you must first upload the data into that stage before
running the copy command.
Now, in the coming lectures, we will look into the details of each type of stage and load data using
each of the type of the stage objects.
LIST @LU_Airport_CSV_Stage;
Data ingestion is the overall process of bringing data from different sources into Snowflake (or any
system).
It covers the end-to-end pipeline: extracting data, moving it, staging it, and then loading it.
Sources can be files (CSV, JSON, Parquet), databases, APIs, streaming data, etc.
👉 Think of ingestion as “getting data into Snowflake, regardless of how or from where.”
Not exactly:
Data Loading → A step inside ingestion, where staged data is actually written into Snowflake tables
(using COPY INTO).
FROM @my_stage/orders.csv
✅ In short:
Data loading = one step in ingestion (moving staged data into a table).
Snowflake
Q1: Is the data in a Snowflake internal stage stored in an encrypted format?
✅ Yes.
Snowflake uses AES-256 strong encryption for files stored in internal stages.
The encryption is fully managed by Snowflake (you don’t need to configure keys manually unless
you want customer-managed keys with Tri-Secret Secure).
Q2: Is the data encrypted prior to being transferred to a Snowflake internal stage?
✅ Yes.
When you run the PUT command from SnowSQL, the data is encrypted in transit using TLS
(Transport Layer Security) before it reaches Snowflake.
Once the file arrives in the stage, it is automatically compressed (gzip) and encrypted at rest.
Yes. All data in Snowflake is stored in an encrypted manner. So, data in tables and internal stages is
encrypted automatically.
Yes. SnowSQL encrypts data before performing the PUT operation, so the data is encrypted well before it is
uploaded.
Snowflake
Snowflake Objects
Account
User
Role
Virtual Warehouse
Resource Monitor
Integration
Database
Schema
Table
View
Stored Procedure
Stage
File Format
Pipe
Sequence
View Databases
To view all the listed warehouses, user can use the following SQL. It brings the details of all listed
warehouses.
SHOW DATABASES
To clone a database, user can use the following SQL, here TEST_DB_3 is a new database named
while DEMO_DB is used to clone it −
CREATE DATABASE TEST_DB_3 CLONE "DEMO_DB"
Edit/Modify/Alter Warehouse
To alter/modify the warehouse, use the following query and run it −
ALTER WAREHOUSE "TEST_WH"
SET WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 1200
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 1
SCALING_POLICY = 'STANDARD'
COMMENT = ' '
View Warehouses
To view all listed warehouses, user can use following SQL. It brings
details of all listed warehouses.
SHOW WAREHOUSES
Schema is grouping of database objects like tables, views etc. Each schema belongs to a single database.
Database.Schema is the namespace in Snowflake. While performing any operation, the namespace is
required to provide either directly in query or setting up in Snowflake's UI.
Snowflake
Table Types:-
Snowflake categorizes tables into different types based on its uses and nature. There are four types of
tables −
Permanent Table
Permanent tables are created in the database.
These tables are designed to store the data that requires highest level of data protection and
recovery.
Time travel is possible in these tables up to 90 days, i.e., that someone can get the data up to 90
days back.
Temporary Table
Temporary tables, as the name indicates, exist for a shorter duration.
If a user wants a temporary table for his subsequent queries and analytics, then once a session is
completed, it automatically drops the temporary table.
Transient Table
These tables persist until the users drop or delete them.
It is used where "data persistence" is required but doesn't need "data retention" for a longer
period. For example, the details of guest visitors of a website, the details of users who visited a
website as well as registered on it, so after registration, storing the details in two different tables
might not be required.
Here, the word removed is used, as external tables are like outside of snowflake and they can't be
dropped or deleted. It should be removed.
It can be visualized as Snowflake over an external data lake, i.e., the main source of data lake is
pointed to Snowflake to utilize the data as per user's need.
Data cannot be directly accessed. It can be accessed in Snowflake via an external stage.
View Types
There are three main categorized views in Snowflake −
Standard View
It is the default view type.
Underlying DDL is available to any role who has access to these view.
Secure View
Secure View means it can be accessed only by authorized users.
Authorized users with proper role can access these tables and execute the queries.
In secure view, Snowflake query optimizer bypasses optimizations used for regular view.
Materialized View
Materialized view is more like a table.
These views store the result from the main source using filter conditions. For example, a company
has records of all employees who are active, inactive, or deceased from starting of the company.
Now, if a user needs the details of active employees only, then the main table can be queried and
stored as materialized view for further analytics.
Materialized view is auto-refreshed, i.e., whenever the main table gets additional/new employee
records, it refreshes the materialized view as well.
Total costs for materialized views are based on "data storage + compute + serverless services."
Compute charges per materialized view are calculated based on the volume of data changes.
SHOW TABLES
Key Concepts
Storage Layer: This is where all data is stored in a compressed, columnar format, optimized for
cloud storage.
Compute Layer (Virtual Warehouses): These are clusters of computing resources that perform all
data processing tasks. They can be resized based on workload requirements.
Cloud Services Layer: Manages metadata, query optimization, security, collecting and maintaining
statistics, and other services that ensure smooth operation.
Snowgrid: A cross-cloud system that connects regions and platforms (AWS, Azure, GCP) for
seamless collaboration, governance, and data replication globally.
Snowflake Cortex AI: Provides instant access to advanced AI tools, including large language models,
directly within the Snowflake AI Data Cloud.
Snowflake
Explore Snowflake's Storage Layer
The Storage Layer in Snowflake handles all data storage tasks, ensuring that data is efficiently organized,
compressed, and encrypted. This layer supports structured, semi-structured, and unstructured data,
providing flexibility for diverse data types.
🔑 Key Insights:
Data Compression and Storage Formats: Snowflake uses advanced compression algorithms to
reduce storage costs and improve performance. Data is stored in a columnar format, which
optimizes query speed and storage efficiency.
Data Encryption: All data in Snowflake is encrypted both at rest and in transit, ensuring high levels
of security.
A financial services (FinServ) company needs to store large volumes of transactional data securely. By using
Snowflake’s Storage Layer, they can efficiently compress and encrypt the data, ensuring both cost-
effectiveness and security.
Best Practice: Always compress data before storage to optimize costs and enhance performance.
🧠 Questions to Consider:
1. How does Snowflake’s use of data compression benefit data storage and cost management?
3. Why is data encryption important, and how does Snowflake implement it?
The Compute Layer is where all data processing in Snowflake occurs. This layer consists of virtual
warehouses—independent clusters of compute resources that can be scaled up or down based on the
workload.
🔑 Key Insights:
Scaling and Managing Virtual Warehouses: Virtual warehouses can be resized, suspended, or
resumed to optimize cost and performance.
Workload Isolation: Each virtual warehouse operates independently, ensuring that the performance
of one does not affect another. This isolation is crucial for maintaining consistent performance
across different workloads.
Best Practice: Regularly monitor warehouse usage and adjust configurations based on actual
workload requirements.
🧠 Questions to Consider:
3. What are the cost implications of running multiple virtual warehouses simultaneously?
The Cloud Services Layer coordinates all Snowflake operations, handling activities such as query
optimization, metadata management, and security enforcement. This layer ensures that the platform
operates efficiently and securely.
🔑 Key Insights:
Query Optimization: Snowflake’s cloud services optimize queries to ensure fast and efficient data
retrieval.
Security Management: This layer manages all security protocols, including user authentication and
access control, to protect data integrity and privacy.
A data science team needs to run complex queries involving multiple datasets. Snowflake’s query
optimization feature ensures these queries are executed quickly, saving time and computing resources.
🧠 Questions to Consider:
1. How does the Cloud Services Layer contribute to overall performance in Snowflake?
2. What are some of the security management functions performed by the Cloud Services Layer?
Database Storage
Query Processing
Cloud Services
Database Storage
Snowflake
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized,
compressed, columnar format. Snowflake stores this optimized data in cloud storage.
Snowflake manages all aspects of how this data is stored — the organization, file size, structure,
compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data
objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible
through SQL query operations run using Snowflake.
Query Processing
Query execution is performed in the processing layer. Snowflake processes queries using “virtual
warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes
allocated by Snowflake from a cloud provider.
Each virtual warehouse is an independent compute cluster that does not share compute resources with
other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other
virtual warehouses.
Cloud Services
The cloud services layer is a collection of services that coordinate activities across Snowflake. These
services tie together all of the different components of Snowflake in order to process user requests, from
login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake
from the cloud provider.
Authentication
Infrastructure management
Metadata management
Access control
Connecting to Snowflake
A web-based user interface from which all aspects of managing and using Snowflake can be
accessed.
Command line clients (e.g. SnowSQL) which can also access all aspects of managing and using
Snowflake.
ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to
Snowflake.
Native connectors (e.g. Python, Spark) that can be used to develop applications for connecting to
Snowflake.
Snowflake
Third-party connectors that can be used to connect applications such as ETL tools (e.g. Informatica)
and BI tools (e.g. ThoughtSpot) to Snowflake.
VALIDATION_MODE in the Snowflake COPY INTO statement is super useful when you want to test your
data load before actually inserting anything into the target table.
It tells Snowflake to validate the input files during a COPY INTO operation instead of (or in addition
to) loading the data.
This helps you catch errors before polluting your table with bad rows.
🔹 Syntax
FROM @<stage>
VALIDATION_MODE = RETURN_<mode>;
🔹 Modes of VALIDATION_MODE
1. RETURN_ERRORS
o Returns rows that failed during the load (e.g., type mismatch, missing fields).
o Example:
o FROM @my_stage
o FILE_FORMAT = my_csv_format
o VALIDATION_MODE = RETURN_ERRORS;
2. RETURN_ALL_ERRORS
o Example:
o FROM @my_stage
o FILE_FORMAT = my_csv_format
o VALIDATION_MODE = RETURN_5_ROWS;
🔹 Example
FROM @my_stage
VALIDATION_MODE = RETURN_ERRORS;
✅ In short:
A. It would skip loading the file This is the right answer, by default files are not re-loaded. This is done
to prevent duplicates, the load metadata is stored in the table and the COPY knows that the file has
already been loaded. Truncate the table and you can reload the file again
How can you load duplicate data into a table with the COPY statement?
A. Force = True
A. Abort the statement and rollback all the rows This is the right answer, Snowflake aborts loading if
any of the files it is loading encounters bad rows and the entire load will be roll backed
You are expecting some oversized text data to appear in your data files and want to truncate the column
data and load it. How will you do that ?
A. TRUNCATECOLUMNS=TRUE This is the right answer, when TRUNCATECOLUMNS=TRUE any data
greater than the length of the column will be truncated and loaded into the target table.
You are loading 500 small files using a single COPY statement from Snowsight and want to see only partially
loaded files names in the output.
How can you do that ?
A. RETURN_FAILED_ONLY = TRUE This is the right answer, when RETURN_FAILED_ONLY is set to TRUE,
files that have rejected rows will be displayed and all rows that have no rejected rows will not have
their names displayed, instead the count of sucessfully loaded files will be shown along with the
number of rows loaded
A. No rows are inserted into the target table This is the right answer, VALIDATION_MODE displays rows
or errors, it does not insert any data into the target table
You have 900 files of the exact same structure that need to be loaded to a single target table each of 1 GB.
You want to limit the number of files that are processed in a single COPY statement to 1 every time it is
run. How can you do that ?
A. SIZE_LIMIT This is the right answer, SIZE_LIMIT can be used to limit the number of files that are
picked up by the COPY statement. Once the number of BYTES provided in the SIZE_LIMIT is
exceeded COPY stops picking new files for loading and exits. The COPY can then be re-run to load
the next set of files. To ensure that just 1 file is picked, set SIZE_LIMIT =0
You create a COPY statement with COPY option SIZE_LIMIT and are trying to determine what would be
the best value to provide.
In Snowflake, once a column is created with a VARCHAR type, you cannot reduce its maximum length
(e.g., from VARCHAR(16777216) (which is string data type)→ VARCHAR(20)). That’s why you’re seeing this
error.
FROM TABLE(
INFORMATION_SCHEMA.COPY_HISTORY(
Good question 👍
In Snowflake:
VARCHAR or STRING (without a number) defaults to the maximum allowed length: 16,777,216
characters.
🔑 Differences:
VARCHAR(20) Stores up to 20 characters only. If you try to insert more, Snowflake truncates silently.
⚡ Storage:
It doesn’t allocate the full VARCHAR(20) width for every row — it only stores what’s needed for each
value, plus metadata.
So storage size is the same whether you use VARCHAR(20) or STRING if your data values are the
same.
o STRING → up to 16 MB.
✅ So to answer you:
Yes — the same values will consume the same storage, whether the column is VARCHAR(20) or STRING.
The only difference is the maximum length allowed.
👉 Do you want me to also show you how to check max length actually used in a column (so you know
whether VARCHAR(20) is enough)?
NUMBER(p,s)
INT, INTEGER, BIGINT, SMALLINT are all just aliases for NUMBER(38,0) (max 38 digits, no decimals).
✅ Example
INT → actually NUMBER(38,0) → allows up to 38 digits (e.g., way bigger than typical SQL int).
Snowflake
⚡ Storage behavior
It stores numbers in a compressed, variable-length format, based on actual size of the value.
So inserting 123 into NUMBER(10) and into INT → will use the same storage.
👉 Do you want me to show you a quick query to check the maximum digits actually used in your numeric
column (like we did for varchar)?
🔑 Short answer:
No performance or storage benefit when you use NUMBER(10) instead of INT in Snowflake.
Why?
o Storing 123 takes the same space whether the column is NUMBER(10) or INT.
Even though there’s no performance/storage gain, it’s still useful for data integrity:
If a column should only have up to 10 digits (e.g., phone number, employee_id, account_id),
defining NUMBER(10) enforces that.
Makes your schema self-documenting → other developers know the expected range.
⚡ Example
);
If someone tries:
✅ So:
No performance/storage benefit.
👉 Do you want me to also compare this with FLOAT vs NUMBER in Snowflake? (That’s where
performance/storage does change a bit.)
Which data format is used for storing and transmitting data between applications ? Json
FORCE=TRUE
PURGE = TRUE
ON_ERROR = SKIP_FILE
Snowflake
ON_ERROR=ABORT_STATEMENT
SIZE_LIMIT = 2000
RETURN_FAILED_ONLY = TRUE
VALIDATION_MODE =’RETURN_ERRORS’
YOU MUST DEFINE WAREHOUSE IN TASK SCHEDULE IF YOU DON’T MENTION THEN THE DEFAULT SIZE FOR
THE WAREHOUSE IN TASKS SCHEDULE IS MEDIUM.
CASE
ELSE '#######################'
END
THE INPUT VALUE VAL DATA TYPE , THE PHONE NUMBER DATA TYPE AND THE RETURN DATATYPE SHOULD
MATCH.
ONCE YOU SET THE MASKING POLICY TO A COLUMN IF YOU WANT TO DROP THE MASKING POLICY OR IF
YOU WANT TO CHANGE THE CONDITIONS (LIKE ADDING MORE ROLES TO THE MASKING POLICY) IN THE
MASKING YOU MUST FIRST UNSET THE MASKING POLICY FROM THE COLUMN AND THEN YOU HAVE TO
TRY.
CASE
ELSE '#######################'
END;
NOTE:- YOU CANNOT APPLY TWO DIFFERENT MASKING POLICIES TO A SINGLE COLUMN.
IF THE TABLE IS SET TO THE MASKING POLICY AND IF YOU TRY TO COPY OR CLONE THE TABLE THEN THE
DATA IN THE NEW TABLE WILL BE THE MASKED VALUES.
IT WILL TRY TO FETCH THE DATA ON THE MASKING DATA ONLY NOT THE ORIGINAL DATA.
Q. WHAT IF WE CREATE A NEW TABLE WITH THE EXISTING DATA IN SOME OTHER DATA WHICH HAS
MASKING POLICY. WILL THE POLICY IS SET TO THE NEW TABLE ALSO?
A. NO THAT MASKING POLICY WON’T BE SET TO THE NEW TABLE BUT THE MASKED VALUES WILL BE
DIRECTLY INSERTED INTO THE NEW TABLE.
DROPPING A TABLE DOESN’T AUTOMATICALLY DROPS THE MASKING POLICIES . EVEN IF YOU DROP THE
TABLE THE MASKING POLICY WHICH IS ASSOCIATED WITH IT WILL BE THERE .
The Masking policy existence is completely different from the existence of a table that has a masking policy
on it. A given policy can be used on multiple tables or multiple columns in the same table and hence
dropping the table does not cause the masking policy to be dropped. If you need to drop the masking
policy, all use of the policy must be stopped/removed and then the policy can be dropped.
Snowflake
When data returned from a SELECT is masked, any table created with either CLONE or Create Table As
SELECT will have the data as masked. This data in the newly created table is permanently masked like static
masking and cannot be reversed. Please use a Role that has access to the unmasked data if you want to
create a table that has no masked data
FALSE
Masking policies can be used on multiple columns on the same or different tables.What is important is to
make sure that the RETURNS data type and the data type of the column that is being masked is exactly the
same.
Which view will you use to locate the tables and columns that are being masked by a Policy?
INFORMATION_SCHEMA.POLICY_REFERENCES
INFORMATION_SCHEMA.POLICY_REFERENCES can be used to locate the table and columns that are actively
using the masking policy. You can SHOW MASKING POLICIES to see all the masking policies available and
then use the INFORMATION_SCHEMA.POLICY_REFERENCES to get the table names and columns names
that are masked
YOU CANNOT ADD MULTIPLES DATABASES TO A SINGLE SHARE OBJECT BUT YOU CAN ADD MULTIPLES
SCHEMAS AND TABLES AND VIEWS FROM A SINGLE DATABASE.
EXAMPLE SITUTATION: FOR EXAMPLE I HAVE CREATED A SHARE FOR DATABASE HRMS WITH SCHEMA HR
AND FOR ALL TABLES AND AFTER CREATING THE SHARE AND ASSIGNING IT TO THE ACCOUNT . THEN IF I
CREATE ANY OTHER TABLES IN THE SAME SCHEMA OR DATABASE , THE NEWLY CREATED TABLES WILL NOT
BE SHOWING IN THE SHARE DUE TO THE SECURITY PURPOSES IF WE WANT TO ADD THE NEWLY CREATED
TABLE THEN WE SHOULD GRANT THE PERMISSION ON THE TABLE TO THE SHARE.
IF ANY CHANGES OR UPDATES MADE TO THE TABLE WHICH IS ADDED TO THE SHARE OBJECT , THEN THE
CHANGES WILL ALSO REFLECT IN THE CONSUMER ACCOUNT.
YOU CANNOT ADD OR GRANT PERMISSIONS OF A NORMAL VIEW(WHICH IS NOT SECURE VIEW) TO THE
SHARE. TO DO SO YOU HAVE TO ALTER THE ATTRIBUTE OF THE SHARE OBJECT SECURE_OBJECTS_ONLY =
FALSE
THEN IT WILL ALLOW THE NON SECURE VIEWS ALSO TO ADD TO THE SHARE.
Snowflake
And also the underlying tables are not from the same database.
I will switch back to the producer account and use the account admin role for the next step.
I will follow this up by creating a table called Daily Sales, which has three columns employee ID,
I will join the two tables and select data from it to make sure that the tables are accessible, and
Now, based on the same join condition, I am going to create a secure view.
This view is different from a previous view because it uses two tables and they are both from different
databases.
The grant failed because I am referencing the sales DB object in the view and share.
HRMS does not have usage privilege on the sales db database and sales schema.
Now if I try to grant select on the sales DB and sales schema that fails to because only objects from
one database can be given to a share, and sales DB is not that database.
The way to do that is by granting reference usage on database sales db to share Ms. reference usage
gives select access on the table inside the sales db to the share.
So if you have tables from multiple databases that need to be referenced inside a view, and that view
needs to be added to a share, you can by granting reference usage on that database to the share, allow
the share to get select access on that table without giving them full access to the database.
Now that the grant is done, I will attempt to add the view to the share again.
This is how you can share data from multiple database with a view on multiple tables and a single share.