11/30/2024
Column-Stores vs. Row-Stores
Contents
Column-store introduction
Column-store data model
Emulation of Column Store in Row store
Column store optimization
Experiment and Results
Conclusion
1
11/30/2024
Row Store and Column Store
Figure taken form [2]
In row store data are stored in the disk tuple by tuple.
Where in column store data are stored in the disk
column by column
Row Store and Column Store
Most of the queries does not process all the attributes of a
particular relation.
For example the query
Select c.name and c.address
From CUSTOMES as c
Where c.region=abc;
Only process three attributes of the relation CUSTOMER. But
the customer relation can have more than three attributes.
Column-stores are more I/O efficient for read-only queries as
they read, only those attributes which are accessed by a query.
2
11/30/2024
Row Store and Column Store
Row Store Column Store
(+) Easy to add/modify a record (+) Only need to read in relevant data
(-) Might read in unnecessary data (-) Tuple writes require multiple accesses
So column stores are suitable for read-mostly, read-
intensive, large data repositories
Why Column Stores?
Can be significantly faster than row stores for some
applications
Fetch only required columns for a query
Better cache effects
Better compression (similar attribute values within a column)
But can be slower for other applications
OLTP with many row inserts, ..
3
11/30/2024
Column Stores - Data Model
Standard relational logical data model
EMP(name, age, salary, dept)
DEPT(dname, floor)
Table – collection of projections
Projection – set of columns
Horizontally partitioned into segments with segment
identifier
Column Stores - Data Model
To answer queries, projections are joined using Storage
keys and join indexes
Storage Keys:
Within a segment, every data value of every column is
associated with a unique Skey
Values from different columns with matching Skey belong to
the same logical row
4
11/30/2024
Column Stores – Data Model
Join Indexes
T1 and T2 are projections on T
M segments in T1 and N segments in T2
Join Index from T1 to T2 is a table of the form:
(s: Segment ID in T2, k: Storage key in Segment s)
Each row in join index matches corresponding row in T1
Join indexes are built such that T could be efficiently
reconstructed from T1 and T2
Compression
Trades I/O for CPU
Increased column-store opportunities:
Higher data value locality in column stores
Techniques such as run length encoding far more useful
Schemes
Null Suppression
Dictionary encoding
Run Length encoding
Bit-Vector encoding
Heavyweight schemes
10
5
11/30/2024
Query Execution - Operators
Select: Same as relational algebra, but produces a bit
string
Project: Same as relational algebra
Join: Joins projections according to predicates
Aggregation: SQL like aggregates
Sort: Sort all columns of a projection
11
Query Execution - Operators
Decompress: Converts compressed column to
uncompressed representation
Mask(Bitstring B, Projection Cs) => emit only those
values whose corresponding bits are 1
Concat: Combines one or more projections sorted in
the same order into a single projection
Permute: Permutes a projection according to the
ordering defined by a join index
Bitstring operators: Band – Bitwise AND, Bor – Bitwise
OR, Bnot – complement
12
6
11/30/2024
Row Store Vs Column Store
Now the simplistic view about the difference in storage
layout leads to that one can obtain the performance
benefits of a column-store using a row-store by making
some changes to the physical structure of the row store.
This changes can be
Vertically partitioning
Using index-only plans
Using materialized views
13
Vertical Partitioning
Process:
Full Vertical partitioning of each relation
Each column =1 Physical table
This can be achieved by adding integer position column to every table
Adding integer position is better than adding primary key
Join on Position for multi column fetch
Problems:
“Position” - Space and disk bandwidth
Header for every tuple – further space wastage
e.g. 24 byte overhead in PostgreSQL
14
7
11/30/2024
Index-only plans: Example
15
Materialized Views
Process:
Create ‘optimal' set of MVs for given query workload
Objective:
Provide just the required data
Avoid overheads
Performs better
Expected to perform better than other two approach
Problems:
Practical only in limited situation
Require knowledge of query workloads in advance
16
8
11/30/2024
Materialized Views: Example
Select F.custID
from Facts as F
where F.price>20
17
Optimizing Column oriented Execution
Different optimization for column oriented database
Compression
Late Materialization
Block Iteration
Invisible Join
18
9
11/30/2024
Compression
Low information entropy (high data value locality) leads
to High compression ratio
Advantage
Disk Space is saved
Less I/O
CPU cost decrease if we can perform operation without
decompressing
Light weight compression schemes do better
19
Compression
If data is sorted on one column that column will be
super-compressible in row store
eg. Run length encoding
Figure taken form [2]
20
10
11/30/2024
Late Materialization
Most query results entity-at-a-time not column-at-a-time
So at some point of time multiple column must be
combined
One simple approach is to join the columns relevant for a
particular query
But further performance can be improve using late-
materialization
21
Late Materialization
Delay Tuple Construction
Might avoid constructing it altogether
Intermediate position lists might need to be constructed
Eg: SELECT R.a FROM R WHERE R.c = 5 AND R.b = 10
Output of each predicate is a bit string
Perform Bitwise AND
Use final position list to extract R.a
22
11
11/30/2024
Late Materialization
Advantages
Unnecessary construction of tuple is avoided
Direct operation on compressed data
Cache performance is improved (PAX)
23
Thank You!
24
12