Processing data
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
A general de nition
Data processing: converting raw data into meaningful information
DATA ENGINEERING FOR EVERYONE
Data processing value
Conceptually At Spot ix
Remove unwanted data No long term need for testing feature data
Optimize memory, process and network Can't afford to store and stream les this
costs big
Convert data from one type to another
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Data processing value
Conceptually At Spot ix
Remove unwanted data No need for lossless format
To save memory Can't afford to store les this big
Convert data from one type to another Convert songs from .flac to .ogg
Organize data Reorganize data from the data lake to data
warehouses
To t into a schema/structure
Employee table example
Increase productivity
Enable data scientists
DATA ENGINEERING FOR EVERYONE
How data engineers process data
Data manipulation, cleaning, and tidying tasks Rejecting corrupt song les
that can be automated Deciding what happens with missing metadata
that will always need to be done Separate artists and albums tables...
Store data in a sanely structured database ...but provide view combining them
Create views on top of the database tables Indexing
Optimizing the performance of the database
DATA ENGINEERING FOR EVERYONE
1 The difference between batch and stream will be explained in the next lesson!
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Summary
What data processing is
Why it's necessary
What it consists in
How we process data at Spot ix
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Scheduling data
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
Scheduling
Can apply to any task listed in data processing
Scheduling is the glue of your system
Holds each piece and organize how they work together
Runs tasks in a speci c order and resolves all dependencies
DATA ENGINEERING FOR EVERYONE
Manual, time and sensor scheduling
Manually Manually update the employee table
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Manual, time and sensor scheduling
Manually Manually update the employee table
Automatically run at a speci c time Update the employee table at 6 AM
Automatically run if a speci c condition is met
Sensor scheduling
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Manual, time, and sensor scheduling
Manually Manually update the employee table
Automatically run at a speci c time Update the employee table at 6 AM
Automatically run if a speci c condition is met Update the department tables if a new
Sensor scheduling employee was added
DATA ENGINEERING FOR EVERYONE
Batches and streams
Batches Songs uploaded by artists
Group records at intervals Employee table
Often cheaper Revenue table
Streams New users signing in
Send individual records right away
Another example: online vs. of ine listening
DATA ENGINEERING FOR EVERYONE
Scheduling tools
DATA ENGINEERING FOR EVERYONE
Summary
What scheduling is
Different ways to set it up
Difference between batches and streams
How scheduling is implemented at Spot ix
Air ow, Luigi
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Parallel computing
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
Parallel computing
Basis of modern data processing tools
Necessary:
Mainly because of memory
Also for processing power
How it works:
Split tasks up into several smaller subtasks
Distribute these subtasks over several computers
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
1 Emojis by Mohamed Hassan
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Bene ts and risks of parallel computing
Employees = processing units
Advantages
Extra processing power
Reduced memory footprint
Disadvantages
Moving data incurs a cost
Communication time
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Summary
Bene ts and risks
How it's implemented at Spot ix
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Cloud computing
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer
Cloud computing for data processing
Servers on premises Servers on the cloud
Bought Rented
Need space Don't need space
Electrical and maintenance cost Use just the resources we need
Enough power for peak moments When we need them
Processing power unused at quieter times The closer to the user the better
DATA ENGINEERING FOR EVERYONE
Cloud computing for data storage
Database reliability: data replication
Risk with sensitive data
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Multicloud
Pros Cons
Reducing reliance on a single vendor Cloud providers try to lock in consumers
Cost-ef ciencies Incompatibility
Local laws requiring certain data to be Security and governance
physically present within the country
Militating against disasters
DATA ENGINEERING FOR EVERYONE
Summary
Bene ts and risks of cloud computing
How it is implemented at Spot ix
Can cite the main cloud providers and their services
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
We are the
champions
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
Actually, YOU are the champion!
DATA ENGINEERING FOR EVERYONE
What you learned - chapter 1
What Data Engineering is
How important it is
How data engineers differ from data scientists
What a data pipeline is and how it works
DATA ENGINEERING FOR EVERYONE
What you learned - chapter 2
The different structures data can take
How fundamentals SQL is
The differences between data lakes, data warehouses and databases
DATA ENGINEERING FOR EVERYONE
What you learned - chapter 3
How data is processed
How scheduling holds it all together
Parallel computing
Cloud computing
DATA ENGINEERING FOR EVERYONE
And some more
What SQL code actually looks like
Main tools and technologies used in data engineering
And some more
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Lexicon
DATA ENGINEERING FOR EVERYONE
A promise is a promise, DataChamps!
All the exercises are song titles
Search for "DataChamps" on Spotify
DATA ENGINEERING FOR EVERYONE
Congratulations!
D ATA E N G I N E E R I N G F O R E V E R YO N E