Python Data Engineering with Snowpark
Python Data Engineering with Snowpark
Snowpark is Snowflake’s developer framework that enables all Snowpark fits into the larger Snowflake data
data users to bring their work to the Snowflake Data Cloud with engineering ecosystem
native support for Python, SQL, Java, and Scala. With Snowpark,
We’ll also share resources designed to help data engineers get started
data engineers can execute pipelines that feed ML models and
with Snowflake and Snowpark.
applications faster and more securely in a single platform using their
language of choice.
3
CHAMPION GUIDES
SNOWPARK FOR
DATA ENGINEERING
Snowpark, a modern developer framework A SINGLE PLATFORM CUSTOMER SPOTLIGHT
for Snowflake, allows data engineers to build Architecture complexity increases significantly
simple, governed, and fast pipelines in their when different teams use different languages across
Snowflake customer, HyperFinity, a no-code decision
preferred programming languages. Snowpark multiple processing engines. Snowpark streamlines
intelligence platform for retailers and CPGs, uses
offers data engineers many benefits: architectures by natively supporting programming
SQL and Python for their ML and AI initiatives.
languages of choice, without the need for separate
• A single platform that supports multiple With Snowpark, HyperFinity has a single platform
processing engines, as shown in Figure 1. Instead,
languages, including SQL, Python, Java, that supports both languages, thereby eliminating
Snowpark brings all teams together to collaborate on
and Scala the same data in a single platform—Snowflake.
cumbersome data movement and code developed
to keep data movement across different services.
• Consistent security across all workloads with As a result, HyperFinity works more seamlessly—
no governance trade-offs
developing, testing, and deploying Python and SQL in
• Faster, cheaper, and more resilient pipelines. one environment for more agile overall operations.
Figure 1: Snowpark allows developers working in many languages to leverage the power of Snowflake.
4
CHAMPION GUIDES
NO GOVERNANCE TRADE-OFFS FASTER, CHEAPER PIPELINES SNOWFLAKE (AND SNOWPARK)
FOR DATA ENGINEERING
Enterprise-grade governance controls and security Snowpark enables pipelines with better price
are built into Snowflake. For example, Snowpark is performance, transparent costs, and less operational Snowpark is a powerful developer framework for
secure with a design that isolates data to protect overhead thanks to Snowflake’s unique multi-cluster data engineering. Some of the critical use cases for
the network and host from malicious workloads, shared data architecture. Snowflake is a single, data engineers working in Snowpark include:
and gives administrators control over the libraries integrated platform that delivers the performance, • ETL/ELT: Data teams can use Snowpark to
developers execute. Developers can build scale, elasticity, and concurrency today’s transform raw data into modeled formats
confidently, knowing data security and compliance organizations require. regardless of type, including JSON, Parquet,
measures are consistent and built-in. and XML. All data transformations can then be
CUSTOMER SPOTLIGHT packaged as Snowpark stored procedures to
CUSTOMER SPOTLIGHT operate and schedule jobs with Snowflake Tasks
or other orchestration tools.
These benefits can be seen at IQVIA, a leading
• Custom logic: Users can leverage Snowpark’s
provider of analytics, technology solutions, and
EDF, a supplier of gas and zero-carbon electricity User Defined Functions (UDFs) to streamline
clinical research services in the life sciences
to homes and businesses in the United Kingdom, architecture with complex data processing and
industry, and a Snowflake customer. As IQVIA
tapped Snowpark to help deploy data applications. custom business logic written in Python or Java
processed increasingly large volumes of structured,
By working within Snowflake, the project did not in the same platform running SQL queries and
semistructured, and unstructured data, the company
require additional sign-offs and meetings to approve transformations. There are no separate clusters to
had to manage mounting complexity as its
data accessibility. Instead, the EDF team could scale manage, scale, or operate.
business scaled.
seamlessly by working within the security rules • Data science and ML pipelines: Data teams can
Snowflake enables and that applied to the project. Since implementing Snowpark in Snowflake,
use the integrated Anaconda repository and
IQVIA has developed data engineering pipelines
Since integrating Snowpark into its data engineering package manager to collaborate in bringing ML
and intelligent apps more quickly and easily, with
operations, EDF sped up production of customer- data pipelines to production. Trained ML models
consistent enterprise-level governance features
facing, ML-driven programs, from several months to can also be packaged as a UDF to run the model
such as row-level access, data masking, and closer
just three to four weeks, increasing output up by 4x. inference close to data, enabling faster paths from
proximity of data to processing. By leveraging
model development to production.
Snowpark to build their pipelines that process large
volumes of data, IQVIA has realized a cost savings of
3x compared to previous pipeline processes.
5
CHAMPION GUIDES
SNOWFLAKE FOR PYTHON
Using Snowpark for Python, data engineers SNOWPARK FOR PYTHON ARCHITECTURE
can take advantage of familiar tools and
programming languages while benefiting from
the scale, security, and performance of the
Snowflake engine. All processing is run in a
secure Python sandbox right next to your data,
resulting in faster, more scalable pipelines with
built-in governance regardless of the language
used. Figure 2 gives an overview of both the
Snowpark Client API and the Snowflake server
side runtime.
6
CHAMPION GUIDES
For custom Python or Java code, there is no SNOWPARK DATAFRAME SNOWFLAKE SERVER-SIDE RUNTIME
translation to SQL. Rather the code is serialized and Snowpark brings deeply integrated, DataFrame-style Snowflake is cloud-built as a data platform that
sent to Snowflake to be processed inside the programming to the languages that data engineers architecturally separates but logically integrates
Java or Python Secure Sandbox. In the case of prefer to use. Data engineers can build queries in storage and compute, and optimized to enable near-
Python, if the custom code includes any third-party Snowpark using DataFrame-style programming limitless amounts of these resources. Elastic scaling,
open source libraries available in the integrated in Python, using their IDE or development tool of multi-language processing, and unified governance
Anaconda package repository, the package manager choice. Behind the scenes, all DataFrame operations also underpin Snowflake’s architecture.
can help ensure code runs without complex are transparently converted into SQL queries that are
environment management. The intelligent infrastructure is what makes
pushed down to the Snowflake scalable processing
everything just work. Compute clusters can be
engine. Because DataFrames use first-class language
SNOWPARK CLIENT API started, stopped, or resized—automatically or on
constructs, engineers also benefit from support for
the fly—accommodating the need for more or less
The Snowpark Client API is open source and type checking, IntelliSense, and error reporting in
compute resources at any time. Along with flexibility,
works with any Python environment. It allows data their development environment.
Snowflake prioritizes speed, granting near-instant
engineers to build queries using DataFrames right in
access to dedicated compute clusters for each
their Python code, without having to create and pass
workload, so users can take advantage of near-
along SQL strings.
limitless concurrency without degrading performance.
The three architectural layers that integrate within
Snowflake’s single platform are shown in Figure 3.
THE SNOWFLAKE PLATFORM ARCHITECTURE The Snowpark Python server-side runtime makes
it possible to write Python UDFs and Stored
Procedures that are deployed into Snowflake’s
secured Python sandbox. UDFs and stored
procedures are two other key components of
Snowpark that allow data engineers to bring custom
Python logic to Snowflake’s compute engine, while
taking advantage of open source packages pre-
installed in Snowpark.
7
CHAMPION GUIDES
SNOWPARK USER DEFINED FUNCTIONS (UDFS) #Given geo-coordinates, UDF to calculate distance between
distribution center and shipping locations
Custom logic written in Python runs directly in
Snowflake using UDFs. Functions can stand alone or
be called as part of a DataFrame operation to process from snowflake.snowpark.functions import udf
the data. Snowpark takes care of serializing the
import geopandas as gpd
custom code into Python byte code and pushes all
of the logic to Snowflake, so it runs next to the data. from shapely.geometry import Point
8
CHAMPION GUIDES
STORED PROCEDURES -- Create python stored procedure to host and run the snowpark pipeline
to calculate and apply bonuses
Snowpark stored procedures help data engineers
operationalize their Python code and run, create or replace procedure apply_bonuses(sales_table string, bonus_table
orchestrate, and schedule their pipelines. A stored string)
procedure is created once and can be executed returns string
many times with a simple CALL statement in your
language python
orchestration or automation tools. Snowflake
supports stored procedures in SQL, Python, Java, runtime_version = ‘3.8’
the Python code and dependencies into bytecode from snowflake.snowpark.types import *
and store them in a Snowflake stage automatically.
They can be created either as a temporary (session-
def apply_bonuses(session, sales_table, bonus_table):
level) or permanent object in Snowflake.
session.table(sales_table).select(col(“rep_id”), col(“sales_amount”)*0.1).
Stored procedures are single-node, which means
write.save_as_table(bonus_table)
transformations or analysis of data at scale inside a
stored procedure should leverage the Client API or return “SUCCESS”
other deployed UDFs to scale compute across all $$;
nodes of a compute cluster.
To the right is a simple example of how to --Call stored procedure to apply bonuses
operationalize a Snowpark for Python pipeline that
call apply_bonuses(‘wholesale_sales’,‘bonuses’);
calculates and applies a company’s sales bonuses on
a daily basis.
– Query bonuses table to see newly applied bonuses
warehouse = ‘xs’
as
call apply_bonuses(‘wholesale_sales’,‘bonuses’);
9
CHAMPION GUIDES
SNOWFLAKE, ANACONDA, AND pre-installed from the Anaconda repository, including write transformations in the language they find most
THE OPEN SOURCE ECOSYSTEM fuzzy wuzzy for string matching, h3 for geospatial familiar and fit for purpose. And dbt on Snowpark
One of the benefits of Python is its rich ecosystem of analysis, and scikit-learn for machine learning and allows analyses using tools available in the open
open-source packages and libraries. In recent years, predictive data analysis. Additionally, Snowpark is source Python ecosystem, including state-of-the-
open-source packages have been one of the biggest integrated with the Conda package manager so users art packages for data engineering and data science,
enablers for faster and easier data engineering. To can avoid dealing with broken Python environments all within the dbt framework familiar to many SQL
leverage open-source innovation, Snowpark has because of missing dependencies. users. It supports a SQL-first workflow, and in 2022,
partnered with Anaconda for a product integration Using open-source packages in Snowflake is as dbt introduced Python as a second language running
without any additional cost or licensing to the user simple as the code below, which demonstrates how Snowpark under the hood to perform analyses using
beyond warehouse usage. users can call packages such as NumPy, XGBoost, tools available in the open-source Python ecosystem.
Data engineers in Snowflake are now able to speed and Pandas, directly from Snowpark.
up their Python-based pipelines by taking advantage Snowpark also fully supports dbt, one of the most
of the seamless dependency management and popular solutions for data transformation today. It
comprehensive set of curated open-source packages supports a SQL-first transformation workflow, and
provided by Anaconda—all without moving or copying in 2022, dbt introduced support for Python. With
the data. All Snowpark users can benefit from dbt’s support for both SQL and Python, users can
thousands of the most popular packages that are
returns array
language python
runtime_version = 3.8
packages = (‘numpy’,’pandas==1.4.*’,’xgboost==1.5.0’)
handler = ‘udf’
as $$
import numpy as np
import pandas as pd
def udf():
$$;
10
CHAMPION GUIDES
BEST PRACTICES:
DATA ENGINEERING IN
SNOWPARK WITH PYTHON
As the Snowpark for Python developer 1. Maximize use of the Snowpark client for please submit feedback through the Snowflake
community grows rapidly, data engineers are development and the Snowflake engine for Community so Snowflake teams can facilitate
looking for “best practices” to guide their work. secure execution. its integration. If the package is a pure Python
Snowpark can be used with your preferred IDE package, you can unblock yourself and bring in
Understanding how Snowpark DataFrames,
and development and debugging tools, and the the package via Stages.
UDFs, and stored procedures work together
execution can be transparently pushed down 3. Use vectorized UDFs for feature transformations
can make data engineers’ work in Snowflake
to Snowflake. Maximize this utility while being and ML scoring.
more efficient and secure. We’ve compiled a
mindful of the use of to_pandas() from the
short list of best practices for data engineers Vectorized UDFs using the Batch API can
Snowpark Client, which brings full data into
execute scalar UDFs in batches. Use the
working with Python in Snowpark. memory. Also, Cachetools is a Python library that
Python UDF Batch API if leveraging third-party
provides a collection of caching algorithms to
Python packages where transformations are
store a limited number of items for a specified
independently done row by row and the process
duration. They can be used to speed up UDFs
could be efficiently scaled out by processing
and stored procedures by ensuring the logic is
rows in batches. This is a common scenario when
cached in memory in cases of repeated reads.
using third-party Python packages to do machine
2. Accelerate development to production flow with learning-specific transformations on data as part
Anaconda integration. of feature engineering or when executing ML
We recommend using the Snowflake Anaconda batch inference.
channel for local development to ensure 4. Use Snowpark-optimized warehouses for
compatibility between client- and server-side memory-intensive workloads.
operations. Building your code using the latest
Snowpark-optimized warehouses are important
stable versions of third-party packages doesn’t
for data engineers working on large data sets.
require users to specify dependencies because
Consider using a Snowpark-optimized warehouse
the Conda Package Manager takes care of this,
when you run into a 100357 (P0000): UDF
offering tremendous peace of mind. If a desired available memory exhausted error during
package is not available inside Snowflake, development. Avoid mixing other workloads
with workloads that require Snowpark-optimized
warehouses. If you must mix them, consider
calling the session.use_warehouse() method
to switch back to standard warehouses.
11
CHAMPION GUIDES
BEYOND SNOWPARK:
OTHER CAPABILITIES IN THE SNOWFLAKE
DATA ENGINEERING ECOSYSTEM
In addition to Snowpark, Snowflake has Snowflake is constantly enhancing functionality. Snowflake was designed to give data engineers
many other data engineering capabilities Dynamic tables, which provide a way to build access to all data at speed with performance and
that make it a fast and flexible platform that declarative pipelines, are currently in private review reliability at scale to build radically simple data
and offer a different approach to building pipelines pipelines. With innovative pipeline automation and
comprehensively supports simple, reliable data
from Snowpark for Python UDFs and Stored data programmability, data engineers can simplify their
pipelines in any language of choice. Procedures. These tools are designed to automatically workflows and eliminate what’s unnecessary, so they
Figure 4 offers an overview of Snowflake’s advanced process data incrementally as it changes to simplify can focus their effort on their most impactful work.
functionality for ingestion, transformation, and data engineering workloads. Snowflake automates all
delivery that simplify data engineering. the database objects and data manipulation language
management, enabling data engineers to easily build
Snowflake allows data engineering teams to ingest scalable, performant, and cost-effective data pipelines.
all types of data using a single platform, including
streaming or batch and structured, semi-structured, The resulting data pipelines have intelligent
or unstructured. Supported data formats include infrastructure, pipeline automation, and data
JSON, XML, Avro, Parquet, ORC, and Iceberg. programmability. Snowflake’s simplified pipelines
Streaming data, including streams from Apache Kafka then power analytics, applications, and ML models
topics, can also be ingested directly to a Snowflake with only one copy of data to manage and near-zero
table with Snowpipe Streaming, currently in public maintenance. Data can also be accessed and shared
preview. Thanks to the Data Cloud, all this data can directly using secure data sharing capabilities with
be accessed and shared across providers and between internal teams, customers, partners, and even more
internal teams, customers, partners, and other data data providers and consumers through the Snowflake
consumers via the Snowflake Marketplace. Marketplace. Data doesn’t move with Snowflake’s
modern data sharing technology. Instead a data
Data can be transformed using the data engineer’s provider grants a data consumer near-instant access
language of choice using Snowpark. Tasks can be to live, read-only copies of the data. This approach
combined with table streams for continuous ELT reduces latency, removes the need to copy and move
workflows to process recently changed table rows. stale data, and dramatically reduces the governance
Tasks are easily chained together for successive challenges of managing multiple copies of
execution to support more complex periodic the same data.
processing. All of this can be done fast, and scaled to
meet the evolving number of users, data, and jobs of
complex projects.
12
CHAMPION GUIDES
DATA ENGINEERING WITH SNOWFLAKE
Figure 4: Snowflake supports ingestion of unstructured, semi-structured, and structured data while automated workflows facilitate transformation and delivery.
13
CHAMPION GUIDES
GETTING STARTED
WITH SNOWPARK
To develop and deploy code with Snowpark, • Snowsight worksheets: Snowsight is Snowflake’s • Partner integrated solutions: Many Snowpark
developers have always had the flexibility web interface that provides SQL and Python Accelerated partners offer either hosted open
to work from their favorite integrated (currently in public preview) support in a unified, source notebooks or their own integrated
easy-to-use experience. These worksheets experiences. Their solutions include out-of-the-
development environment (IDE) or notebook.
provide autocomplete for the Snowpark session box Snowpark APIs preinstalled and offer secure
Data engineers can easily get started in Snowpark, and can run directly from the browser as a stored data connections. These deeply integrated
beginning development anywhere that can run procedure. Snowsight is a good option for teams experiences speed up the building and deploying
a Python kernel. Minimizing learning curves by looking for a zero-install editor for writing and of pipelines, models, and apps. More information
eliminating the need for a new tool, data engineers running Snowpark and quickly turning that code on partner integrations can be found on the
simply install the Snowpark Client API and establish a into Stored Procedures that can be orchestrated Snowpark Accelerated page.
connection to their Snowflake account. as part of an automated pipeline.
Snowpark aims to give developers flexibility. It • Open-source notebook solutions: One popular RESOURCES
supports many development interfaces, including: option for building pipelines in Snowpark is to
Start harnessing the power of Snowflake with
leverage notebooks. Notebooks enable rapid
• Code editors and IDEs: Many data engineers Snowpark for data engineering, and get started with
experimentation using cells. With Snowpark, you
prefer to build using code editors and IDEs. the resources below:
can run a variety of notebook solutions such as
These offer capabilities such as local debugging,
Jupyter Notebooks, which can be run locally while
autocomplete, and integration with source
connected securely to Snowflake to execute data FREE TRIAL
control. Snowpark works well in VS Code,
operations. Any machine running containers or
IntelliJ, PyCharm, and other tools. VS Code
Python can build and execute Snowpark pipelines.
QUICKSTART
works with a Jupyter extension that provides
A similar approach can be used for working with
a notebook experience within the editor,
Snowpark in other notebook solutions, including DEVELOPER DOCUMENTATION
bringing in breakpoints and debugging to the
Apache Zeppelin. Open source notebook
notebook experience, without requiring separate
solutions are a great choice for data exploration. MEDIUM BLOG
management of the Jupyter container or runtime.
Code editors and IDEs are a great choice for rich
SNOWFLAKE FORUMS
development and testing experience for
building pipelines.
14
ABOUT SNOWFLAKE
Snowflake enables every organization to mobilize their data with Snowflake’s Data Cloud. Customers use the Data Cloud to unite siloed data,
discover and securely share data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single data experience
that spans multiple clouds and geographies. Thousands of customers across many industries, including 573 of the 2022 Forbes Global 2000
(G2K) as of January 31, 2023, use Snowflake Data Cloud to power their businesses.
© 2023 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and service names mentioned herein
are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. All other brand names or logos mentioned or used
herein are for identification purposes only and may be the trademarks of their respective holder(s). Snowflake may not be associated with, or be
sponsored or endorsed by, any such holder(s).
CITATIONS
1
https://insights.stackoverflow.com/survey