Data engineering & Warehousing on
External tables
What is snowflake ?
● Snowflake is a SaaS cloud-based data warehousing platform
available on Azure , AWS and GCP
● Snowflake can't be hosted on self hosted or on premise systems.
● It offers scalable, high-performance data storage and analysis. It
separates compute and storage, allowing independent scaling
and cost management.
● It’s easy to set up and manage, handles multiple users and
workloads simultaneously, and integrates well with other tools.
Snowflake properties
● Snowflake isn’t a relational database ,so it doesn’t enforce
Primary and Foreign key constraints.
● It allows a concept of Snowflake SQL that allows DDL and DML
commands ,stored procedures and User defined functions.
● It allows the creation of Views and Materialised views.
● It allows the implementation of ACID transactions.
● It allows aggregations , window functions and hierarchical
querying CTE and Recursive CTE’s.
Snowflake properties
● Snowflake can easily integrate with ETL tools like Talend ,
Informatica , Pentaho. Etc
● We can leverage data from Snowflake using Bi tools like Power BI
, Tableau etc
● Can even integrate with big data tools like Databricks.
● Even can connect data via JDBC & ODBC drivers
We can leverage Snowflake by using its Snowflake Web , Snowflake
CLI and DBeaver
Snowflake UI
Snowflake offerings
● The storage and compute is easily scalable.
● For tuning purposes you need not to Index , tune the performance,
create partitions and no physical storage design.
● Pay per use can be tuned to reduce the cost.
Snowflake architecture :
The architecture of Snowflake brings together the best features of
shared-disk and shared-nothing database architectures,
respectively-the simplicity of shared-disk with the performance benefits
of shared-nothing.
Layers of Architecture:
Database Storage:
● Data is automatically reorganised into Snowflake's optimised,
compressed, columnar format when loaded.
● Data is stored in cloud storage that is managed wholly by
Snowflake: organization, compression, structure, etc.
● Data objects are not directly accessible by customers; they are
only accessible through SQL queries.
● In our case data is stored in ADLS gen 2 under the hood
Query Processing:
● Performed using "virtual warehouses"-independent MPP compute
clusters.
● Each virtual warehouse is isolated, and no compute resources are
shared among warehouses, so there's no performance impact
between warehouses.
Cloud Services:
● Collection of services that command and coordinate activities
across Snowflake.
● Handles authentication, infrastructure management, metadata
management, query optimization, and access control.
● Runs on compute instances provisioned by Snowflake from the
cloud provider.
● You can choose between cloud services like Azure , AWS & GCP
- In this tutorial we’ve used Snowflake on Azure cloud.
What is a Warehouse :
In Snowflake, the term "warehouse" refers to a virtual compute resource
that processes queries and performs data loading, unloading, and other
operations within the Snowflake environment. A Snowflake warehouse
is not a physical storage facility, rather, it provides the computing power
required to execute queries on the data stored in Snowflake's
cloud-based storage (Azure ADLS G2).
● Warehouses are responsible for executing SQL queries and other
operations.
● Multiple warehouses can operate concurrently, allowing for
parallel processing of tasks.
After going to the Admin option if we have the sys admin or account
admin rights then we can create a warehouse from the above option
To Create & Configure a Warehouse in Snowflake ;
1. Click
2. Then enter the Name and configure the clusters
3. Choose any one type among these two
5. Choose the Size of the warehouse (cluster)
Lesser the size of cluster less the cost i.e; 1 credit/hour for the XS
cluster , but this cluster is suitable for the less intensive or can say light
tasks.
Similarly , the Larger the size more the cost
Usually cost of clusters depend on the Cloud u are choosing and the
Region as well , Refer this link for the pricing Click here
These are the cluster options for Azure Central India location. That we
are using for the tutorial.
6. Then configure the advance option if you want , exactly what we can
configure in the advanced options are :
● Auto resume allows Snowflake warehouses to run the cluster
automatically whenever the query is executed.
● Auto Suspend allows the warehouse to be disabled when not in
use after a certain period of time of inactivity.
● Multi-cluster Warehouse allows increase or decrease in number
of clusters as per load detecting automatically.
This enables us to define the numbers of minimum and maximum
numbers of clusters and the scaling option as standard & Economy
Scaling Policy Start time Endtime
Standard Starts immediately whenever the
system detects there might be
more query that can be executed
by currently available clusters.
Economy
7. The Query Acceleration Service (QAS) helps maintain high
performance in cloud data warehouses by offloading resource-heavy
portions of queries to additional compute resources. This ensures that
large, unpredictable, or complex queries don’t degrade the overall
performance of the warehouse. It’s particularly useful in environments
where workloads vary greatly in size and complexity, such as ad hoc
analytics, large data scans, or queries with highly selective filters.
After configuring the options and setting simply create the Warehouse
OR
We can use the SQL Notebook to create the Warehouse :
Here we can see our newly created warehouse
What if we want to alter the properties of our existing warehouses
Similarly, we can modify other properties and even can drop the
warehouses as per requirement and use case.