The document discusses massive data processing at Adobe using Delta Lake, highlighting various aspects such as data representation, schema evolution, and challenges in data ingestion. It emphasizes the performance benefits of utilizing Delta Lake for handling large-scale data efficiently, while considering issues like schema management and replication lag. Key features like ACID transactions and lazy schema on-read approaches are also outlined to address the complexities of multi-tenant data architecture.
Massive Data
Processing inAdobe
using Delta Lake
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe
2.
Agenda
§ Introduction
§ Whatare we storing?
§ Data Representation and
Nested Schema Evolution
§ Writer Worries and How to
Wipe them Away
§ Staging Tables FTW
§ Datalake Replication Lag
Tracking
§ Performance Time!
3.
Unified Profile DataIngestion
Unified Profile
Experience Data Model
Adobe Campaign
AEM
Adobe Analytics
Adobe
AdCloud
Change Feed Streaming
Stats Generation
Single Tenant
Multi Tenant
Data Layout Ata Glance
An Idea about how the graph linkages are stored
primaryId relatedIds field1 field2 field1000
123 123 a b c
456 456 d e f
123 123 d e l
789 789,101 x y z
101 789,101 x u p
Conditions
• primaryId does not change
• relatedIds can change
6.
New Record comesin
Indicating a new linkage, causing a change in graph membership
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
primaryId relatedId field1 field2 field1000
103 103,789,101 q w r
New Record comes in linking 103 with 789 and 101
Causes a cascading change in rows of 789 and 101
Complexities?
• Nested Fields
•a.b.c.d[*].e nested hairiness!
• Arrays!
• MapType
• Every Tenant has a different Schema!
• Schema evolves constantly
• Fields can get deleted, updated.
• Multiple Sources
• Streaming
• Batch
9.
Scale?
• Tenants have10+ Billions of rows
• PBs of data
• Million RPS peak across the system
• Triggers multiple downstream applications
• Segmentation
• Activation
10.
What is DeltaLake?
Fromdelta.io : Delta Lake is an open-source project that enables building a Lakehouse architecture
on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
ACID
Transactions
Time Travel
(data
versioning)
Uses Parquet
Underneath
Schema
Enforcement
and Schema
Evolution
Audit History
Updates and
Deletes Support
Key Features
Writer Worries andHow to Wipe them Away
• Concurrency Conflicts
• Column size
• When individual column data exceeds 2GB, we see degradation in writes or OOM
• Update frequency
• Too frequent updates cause underlying filestore metadata issues.
• This is because every transacation on an individual parquet causes CoW,
• More updates => more rewrites on HDFS
• Too Many small files !!!
13.
CDC (existing)
Batch Ingestion/ Streaming
Ingestion /
API based Ingest
Mutation Apps
CosmosDB
CDC
1. Send Request to
Cosmos
2.Ack
3.Emit CDC
Consumed by
• Stats
• Edge
• etc
14.
Dataflow with DeltaLake
primary
Id
relatedId
field
1
field2field1000
103 103,789,101 q w r
789 103,789,101 x y z
101 103,789,101 x y z
Cosmos
DB
primaryId relatedId field1 field1000
103 103,789,101 q r
primaryId relatedId jsonString
103 103,789,101 <jsonStr>
789 103,789,101
<jsonStr>
101 103,789,101 <jsonStr>
Staging Table
Change Feed CDC
Raw Table (per tenant)
Check for Work every
X minutes
UPSERT/DELETE into
Raw Table
Fetch
Records
to process
APPEND only!
CDC
Dumper
Backfill
Long Running
Streaming
Application
Processor
Partitioned by tenant and 15 min time intervals
TenantLock in Redis
15.
Staging Tables FTW
Fan-Inpattern vs Fan-out
• Multiple Source Writers Issue Solved
• By centralizing all reads from CDC, since ALL writes generate a CDC
• Staging Table in APPEND ONLY mode
• No conflicts while writing to it
• Filter out. Bad data > thresholds before making it to Raw
Table
• Batch Writes by reading larger blocks of data from Staging
Table
• Since it acts time aware message buffer
Why choose JSONString format?
§ We are doing a lazy Schema on-read approach.
▪ Yes. this is an anti-pattern.
§ Nested Schema Evolution was not supported on update in delta in 2020
▪ Supported with latest version
§ We want to apply conflict resolution before upsert-ing
▪ Eg. resolveAndMerge(newData, oldData)
▪ UDF’s are strict on types, with the plethora of difference schemas , it is crazy to manage UDF per
org in Multi tenant fashion
▪ Now we just have simple JSON merge udfs
▪ We use json-iter which is very efficient in loading partial bits of json and in manipulating them.
§ Don’t you lose predicate pushdown?
▪ We have pulled out all main push-down filters to individual columns
▪ Eg. timestamp, recordType, id, etc.
▪ Profile workloads are mainly scan based since we can run 1000’s of queries at a single time.
▪ Reading the whole JSON string from datalake is much faster and cheaper than reading from
Cosmos for 20% of all fields.
18.
Schema On Readis more
future safe approach for
raw data
§ Wrangling Spark Structs is not
user friendly
§ JSON schema is messy
▪ Crazy nesting
▪ Add maps to the equation, just the
schema will be in MBs
§ Schema on Read using Json-iter
means we can read what we
need on a row by row basis
§ Materialized Views WILL have
structs!
19.
Partition Scheme ofRaw records
• RawRecords Delta Table
• recordType
• sourceId
• timestamp (key-value records will use DEFAULT value)
z-order on primaryId
z-order - Colocate column information in the same set of files using locality-preserving space-filling curves
21.
Replication Lag –2 types
• CDC Lag from Kafka
• Tells us how much more work we need to do to catch up to write to Staging
Table
• How we track Lag on a per tenant basis
• We track Max(TimeStamp) in CDC per org
• We track Max(TSKEY) processed in Processor
• Difference gives us rough lag of replication
22.
Merge/UPSERT Performance
Action: UPSERTCDC stage into fragment
Time Taken
170 K CDC Records – Maps to 100k
Rows in Raw Table
15 seconds
1.7 Million CDC Records – Maps to 1
Million Rows in Raw Table
61 seconds
Live Traffic Usecase: How long does it take X CDC messages to get upserted into Raw Table
23.
Job Performance Time!
HotStore (NoSQL Store) Delta Lake
Size of Data 1 TB 64 GB
Number of Partitions 80 189
Job Cores Used 112 112
Job Runtime 3 hours 25 mins
24.
TakeAways
• Scan IOspeed from datalake >>> Read from Hot Store
• Reasonably fast eventually consistent replication within
minutes
• More partitions means better Spark executor core utilization
• Potential to aggressively TTL data in hot store
• More downstream materialization !!!
• Incremental Computation Framework thanks to Staging tables!