Snowflake - Data Ingestion/Loading
Few million transactions every hour
Snowflake
- Table-1
Few thousand transactions every minute
- Table-2
- Table-3
- Table-4
Few thousand rows to be fetched every 4-6 hours
Rest API
Snowflake - Data loading options
Batch Bulk Data Ingestion :
● Write/load the data into your staging location (S3 , GCS Buckets)
● Ingest the data into Snowflake in batches at frequent time intervals using :
○ Snowflake Copy commands scheduled using Snowflake tasks
○ Trigger Copy commands using python/Glue/Airflow running at specified time intervals
Real-time Data Ingestion :
● Write/load the data into your staging location (S3 , GCS Buckets) and ingest the data in near-real time
using :
○ Snowpipe (Continuous data ingestion)
○ Airflow S3 sensors/triggers
● Kafka-snowflake Connector for real-time data ingestion
Snowflake - Batch Data Ingestion
Snowflake
python/airflow/glue
RDBMS/Structured Data
Data Ingestion pipelines running at
Data extraction
pipelines
scheduled time intervals - Table-1
- Table-2
- Table-3
Load into Snowflake tables
- Table-4
Load/Write to Staging
location AWS S3 Bucket
Snowflake - Real-Time Data Ingestion
Snowflake
AWS Airflow Sensors/Snowpipe objects
Semi-structured streaming data
Triggered on write event - Table-1
- Table-2
Load into Snowflake tables
AWS S3 Bucket
- Table-3
Load/Write to Staging
location
- Table-4
Snowflake - Real-Time Data Ingestion
Snowflake
Semi-structured streaming data
- Table-1
Load into Snowflake tables
- Table-2
Snowflake Kafka Connector
- Table-3
Load/Write to Staging
location
- Table-4
Snowflake - Batch Data Ingestion
Snowflake
- Table-1
Few thousand rows to be fetched every 4-6 hours
- Table-2
Rest API - Table-3
python/airflow/glue
- Table-4
Snowflake - AWS Connection
● Step-1 : Create an IAM Role for Snowflake to access data in S3 Buckets
● Step-2 : Create S3 Bucket in AWS and upload Sample Files into the bucket
● Step-3 : Create an Integration Object in Snowflake for authentication
(using Accountadmin Role )
● Step-4 : Create a File Format Object (Using Sysadmin/Custom Role)
● Step-5: Create a stage object referencing the location from which the data needs to be ingested (Using
Sysadmin/Custom Role)
● Step-5 : Load the data into Snowflake Tables (Using Sysadmin/Custom Role)
Snowflake - What have we done so far ?
● Created a Storage Integration Object and authenticated Snowflake to read data from S3
● Created a Stage Object which refers to the Integration Object (one stage object per table)
● Executed copy commands manually to ingest data into the respective tables
Snowflake - Batch Data Ingestion
Snowflake
- Country
- Orders
- Lineitem
AWS S3 Staging Bucket
Copy command to ingest data
- PARTSUPP
Snowflake - Batch Data Ingestion
Snowflake
New Batch of Data arriving every X Hours
- Country
- Orders
- Lineitem
AWS S3 Staging Bucket
Automatically execute copy commands to ingest
data into tables using Snowpipe - PARTSUPP
Snowflake - Continuous Data Ingestion
● Snowpipe is Snowflake’s continuous data ingestion service
● Snowpipe loads data within minutes after files are added to a stage and submitted for ingestion.
● Snowpipe loads data from stages files in micro-batches rather than manually executing COPY statements
on a schedule to load larger batches.
Data Ingestion - Key Considerations
● Recommended Approach to ensure concurrent data is ingested concurrently .
○ Avoid loading data in a single large file
○ Split large files into smaller files (~100-250 MB Compressed)
● Snowpipe
● Snowpipe loads data from stages files in micro-batches rather than manually executing COPY statements
on a schedule to load larger batches.