Data structures
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
Structured data
Easy to search and organize
Consistent model, rows and columns
De ned types
Can be grouped to form relations
Stored in relational databases
About 20% of the data is structured
Created and queried using SQL
DATA ENGINEERING FOR EVERYONE
Employee table
index last_name rst_name role team full_time o ce
0 Thien Vivian Data Engineer Data Science 1 Belgium
1 Huong Julian Data Scientist Data Science 1 Belgium
So ware United
2 Duplantier Norbert Infrastructure 1
Developer Kingdom
Business
3 McColgan Je Sales 1 United States
Developer
Customer
4 Sanchez Rick Support Agent 0 United States
Service
DATA ENGINEERING FOR EVERYONE
Relational database
o ce address number city zipcode
Belgium Martelarenlaan 38 Leuven 3010
UK Old Street 207 London EC1V 9NR
USA 5th Ave 350 New York 10118
DATA ENGINEERING FOR EVERYONE
Relational database
index last_name rst_name o ce address number city zipcode
0 Thien Vivian Belgium Martelarenlaan 38 Leuven 3010
1 Huong Julian Belgium Martelarenlaan 38 Leuven 3010
2 Duplantier Norbert UK Old Street 207 London EC1V 9NR
3 McColgan Je USA 5th Ave 350 New York 10118
4 Sanchez Rick USA 5th Ave 350 New York 10118
DATA ENGINEERING FOR EVERYONE
Semi-structured data
Relatively easy to search and organize
Consistent model, less-rigid implementation: di erent observations have di erent sizes
Di erent types
Can be grouped, but needs more work
NoSQL databases: JSON, XML, YAML
DATA ENGINEERING FOR EVERYONE
Favorite artists JSON file
{
{"user_1645156":
"last_name": "Lacroix",
"first_name: "Hadrien",
"favorite_artists": ["Fools in Deed", "Gojira", "Pain", "Nanowar of Steel"]},
{"user_5913764":
"last_name": "Billen",
"first_name: "Sara",
"favorite_artists": ["Tamino", "Taylor Swift"]},
{"user_8436791":
"last_name": "Sulmont",
"first_name: "Lis",
"favorite_artists": ["Arctic Monkeys", "Rihanna", "Nina Simone"]},
...
}
DATA ENGINEERING FOR EVERYONE
Unstructured data
Does not follow a model, can't be contained in rows and columns
Di cult to search and organize
Usually text, sound, pictures or videos
Usually stored in data lakes, can appear in data warehouses or databases
Most of the data is unstructured
Can be extremely valuable
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Adding some structure
Use AI to search and organize unstructured data
Add information to make it semi-structured
DATA ENGINEERING FOR EVERYONE
Summary
Structured data
Semi-structured data
Unstructured data
Di erences between the three
Give examples
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
SQL databases
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer at DataCamp
SQL
Structured Query Language
Industry standard for Relational Database Management System (RDBMS)
Allows you to access many records at once, and group, lter or aggregate them
Close to wri en English, easy to write and understand
Data engineers use SQL to create and maintain databases
Data scientists use SQL to query (request information from) databases
DATA ENGINEERING FOR EVERYONE
Remember the employees table
index last_name rst_name role team full_time o ce
0 Thien Vivian Data Engineer Data Science 1 Belgium
1 Huong Julian Data Scientist Data Science 1 Belgium
So ware United
2 Duplantier Norbert Infrastructure 1
Developer Kingdom
Business
3 McColgan Je Sales 1 United States
Developer
Customer
4 Sanchez Rick Support Agent 0 United States
Service
DATA ENGINEERING FOR EVERYONE
SQL for data engineers
Data engineers use SQL to create, maintain and update tables.
CREATE TABLE employees (
employee_id INT,
first_name VARCHAR(255),
last_name VARCHAR(255),
role VARCHAR(255),
team VARCHAR(255),
full_time BOOLEAN,
office VARCHAR(255)
);
DATA ENGINEERING FOR EVERYONE
SQL for data scientists
Data scientist use SQL to query, lter, group and aggregate data in tables.
SELECT first_name, last_name
FROM employees
WHERE role LIKE '%Data%'
DATA ENGINEERING FOR EVERYONE
Database schema
Databases are made of tables
The database schema governs how tables are related
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Several implementations
SQLite
MySQL
PostgreSQL
Oracle SQL
SQL Server
DATA ENGINEERING FOR EVERYONE
Summary
SQL = industry standard
Explain how Data engineers and Data scientists use it di erently
Database schema
SQL implementations
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E
Data warehouses
and data lakes
D ATA E N G I N E E R I N G F O R E V E R YO N E
Hadrien Lacroix
Content Developer
Warehouses with stunning view on the lake
DATA ENGINEERING FOR EVERYONE
DATA ENGINEERING FOR EVERYONE
Data lakes and data warehouses
Data lake Data warehouse
Stores all the raw data Speci c data for speci c use
Can be petabytes (1 million GBs) Relatively small
Stores all data structures Stores mainly structured data
Cost-e ective More costly to update
Di cult to analyze Optimized for data analysis
Requires an up-to-date data catalog Also used by data analysts and business
analysts
Used by data scientists
Ad-hoc, read-only queries
Big data, real-time analytics
DATA ENGINEERING FOR EVERYONE
Data catalog for data lakes
What is the source of this data? Good practice for any data storage
solution
Where is this data used?
Reliability
Who is the owner of the data?
Autonomy
How o en is this data updated?
Scalability
Good practice in terms of data governance
Speed
Ensures reproducibility
No catalog --> data swamp
DATA ENGINEERING FOR EVERYONE
Database vs. data warehouse
Database:
General term
Loosely de ned as organized data stored and accessed on a computer
Data warehouse is a type of database
DATA ENGINEERING FOR EVERYONE
Summary
Data lakes
Data warehouses
Databases
Data catalog
DATA ENGINEERING FOR EVERYONE
Let's practice!
D ATA E N G I N E E R I N G F O R E V E R YO N E