DATS310 d
Databases for Big Data
DR. RICHA SHARMA
C O M M O N W E A LT H U N I V E R S I T Y
1
Introduction
Architecture for databases:
Focuses on storage and organization of information to
allow easy access and modification (insert, update, delete
operation) of data.
Database design and application development depends a
lot on database architecture!
Architectural design of Database varies just as network
topology varies.
Helps in identifying which database design is best suitable
for the problem at hand, i.e. the application to be
developed!
2
Tools/Technologies for Big Data
Few Examples:
Apache Hadoop, Spark, Kafka, Hive, Storm
MongoDB and CouchDB
Redis, Cassandra and Neo4j
Druid and Google Big Query
AWS DynamoDB
Google Big Query
Tableau
3
Questions to explore
Type of database – does the problem at hand requires
relational database, key-value pair database, columnar
database, document-oriented database or graph
database?
Nature of problem and usage of database – does the
problem require flexibility or does it require parallel
processing?
Communication interface of database – are we going to
interact with database through an interactive command-like
interface or through the application requiring database
connectivity and programming language interfacing?
4
Questions to explore
Unique characteristic of database – Any database will support
writing data and reading it back again, but what makes it
unique? Some allow querying on arbitrary fields; some
provide indexing for rapid lookup; some support ad hoc
queries, while queries must be planned for others.
Performance – How does this database function and at what
cost? How about replication? Is this database tuned for
reading, writing, or some other operation?
Scalability – Scalability closely related to performance and
point to explore is if the database is geared more for
horizontal scaling (MongoDB, HBase, DynamoDB) or
traditional vertical scaling (Postgres, Neo4J, Redis), or
something in between.
5
RDBMS vs Big Databases
6
Key-Value Pair Database
Simplest database model, storing data as key-value (KV) pair
just like a hash-table.
Some KV implementations provide a means of iterating
through the keys, but not all!
A file system can be considered a key-value store assuming
the file path as the key and the file contents as the value.
Since this database model doesn’t require complex data
structures for storage, it can be incredibly performant in a
number of scenarios but generally won’t be helpful when we
have complex query and aggregation requirements.
Example: Redis, DynamoDB, Voldemort, Riak etc.
7
Columnar Database
Columnar, or column-oriented, databases are so named
because these database store the data from a given column
(in the two-dimensional table sense) together, as opposite to
row-oriented databases (RDBMS).
These databases make adding columns to table quite
inexpensive, and this is done on a row-by-row basis.
Each row can have a different set of columns, or none at all,
allowing tables to remain sparse without incurring a storage
cost for null values.
With respect to structure, columnar is about midway between
relational and key-value. Example: HBase, Cassandra etc.
8
Document Database
Meant to store documents, considering a document like a
hash, with a unique ID field, and values that may be any of a
variety of types, including more hashes.
Documents can contain nested structures, and so they exhibit
a high degree of flexibility, allowing for variable domains.
But, the system imposes few restrictions on incoming data, as
long as it meets the basic requirement of being expressible as
a document.
Different document databases take different approaches with
respect to indexing, ad hoc querying, replication, consistency,
and other design decisions.
Example: MongoDB, CouchDB etc.
9
Graph Database
Less commonly used database styles, but graph databases
are best for working with highly interconnected data.
A graph database consists of nodes and relationships
between nodes.
Both nodes and relationships can have properties and key-
value pairs that store data.
Real strength of graph databases is traversing through the
nodes by following relationships..
Example: Neo4J, Polyglot etc.
10