BI, Reporting and Analytics on Apache Cassandra

BI, Reporting and Analytics on
Apache Cassandra
27/10/2015
Victor Coustenoble Solutions Engineer
victor.coustenoble@datastax.com
@vizanalytics

Agenda
• DataStax & Apache Cassandra
• Data Modeling and CQL
• Data Access
• Reporting and Analytics
• DataStax Enterprise Analytics
• Architectures
• Hadoop + Cassandra use cases
©2014 DataStax Confidential. Do not distribute without consent. 2

© 2014 DataStax Confidential. Do not distribute without consent.
DataStax delivers Apache Cassandra in a database platform
purpose-built for the performance and availability demands
of Web, Mobile, and IOT applications, giving enterprises a
secure always-on database that remains operationally simple
when scaled in a single datacenter or across multiple
datacenters and clouds.
“
“
Elevator Pitch

No Vertical Market Concentration

Functional use cases
Messaging
Collections/
Playlists
Fraud
detection
Recommendation/
Personalization
Internet of things/
Sensor data

Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-
critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Data Modeling
Cassandra is not like well known RDBMS systems:
• No a relational model
• No foreign keys, no joins, no agregations
• Modeling guided by requests to be supported, by data access and by
actions (filters, grouping and order needs)
Denormalisation
• Combine columns from different tables in a unique table (“materialized
view”), no joins!
• Better performances, less data trafic
• Don’t be afraid to duplicate data, to write data
• Avoid joins at client level

Cassandra Data Model
• Based on Google Bigtable
• Row-oriented column family
• De-normalised
CREATE TABLE sporty_league (
team_name varchar,
player_name varchar,
jersey int,
PRIMARY KEY (team_name, player_name)
);
SELECT * FROM sporty_league;
The primary key uniquely identifies a row.
A composite primary key consists of:
• A partition key
• One or more clustering columns
e.g. PRIMARY KEY (partition key, cluster columns, ...)
• The partition key determines on which node the partition resides
• Data is ordered in cluster column order within the partition

CQL – Cassandra Query Language
©2014 DataStax Confidential. Do not distribute without consent.
• Data type : BLOB, UUID, TIMEUUID, User Defined Type
…
• User Defined Functions, User Defined Aggregates
• Collections : Map, List, Set
• TTL (Time-To-Live) at column level
• Counters
• Lightweight Transactions (LWT) : race condition problem
solving with IF NOT EXISTS
• Batch statements
• Secondary Index
• Very similar to RDBMS SQL syntax
• Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT …
INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10);
SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’;
DevCenter

Cassandra Data Access
CQL language via cqlsh (command line) or DevCenter
(development environnement) or drivers
• Drivers on Cassandra native protocol
• Command CQL COPY
• Import/Export tools for massive bulk loader
• Connectors in ETL solutions (Talend, Informatica)
• Via analytics layers Spark and Hadoop
• Via ODBC/JDBC drivers

Cassandra Clients - Native Driver
DataStax drivers available and supported: Java, Python, C#, C++, Ruby, Node.js,
PHP (much more to come like Scala, Go…)
This includes:
• Load Balancing
• Data Centre Aware
• Latency Aware
• Token Aware
• Reconnection policies
• Retry policies
• Downgrading Consistency
• Plus others…

Connexions ODBC / JDBC
ODBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Hive (Hadoop SQL engine)
• For Cassandra directly (ANSI SQL or CQL requests)
JDBC drivers
• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server
• For Cassandra directly (in progress)
• JDBC drivers from the community but not officialy supported

Real-Time / Operational Analytics Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…

How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only
2 solutions:
• CQL with predictable queries
• Joins and Aggregations on the fly:
Server level => Need a distributed processing framework : Hadoop or Spark
Client level => Possible but risky !

Reporting and Dashboard
Confidential 20
• Static and operational dashboards and reports created for a
specific Cassandra application.
• CQL, Solr queries and DataStax drivers
• KPI and aggregations pre-calculated with scheduled batch or on
the fly during insert.

BI & Data Visualization tools
21
For BI and Data Visualization tools like Tableau Software,
Power BI, Qlikview, Excel ….
• DataStax ODBC driver
SQL joins and aggregations executed at client level !
• Spark ODBC driver (from Databricks or Microsoft)
SQL translated in Spark jobs and executed at server level

Tableau Software
22
Databricks Spark ODBC Driver for SparkSQL
Live SQL queries to Spark or Extract data on local client

Power BI Desktop
23
Support for On-Prem Spark distributions
“The new data source in this month’s release is support for On-Prem Spark distributions. Last
month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding
to other Spark distributions.
This new connector can be found under the “Other” category in the “Get Data” dialog.”
http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-
september-update.aspx
Microsoft Spark ODBC Driver

Notebook
24
Run code (Spark or CQL) from a Web browser
Notebooks like Zeppelin, Spark Notebook, Jupyter
For example Zeppelin:
• Examples available for Cassandra
• CQL language interpretor
• https://github.com/doanduyhai/incubator-zeppelin

Analytics with DataStax Enterprise
There are 4 ways to do Analytics on Cassandra data:
• Reporting with CQL queries
• Integrated Search (Solr)
• Integrated Batch Analytics (Hadoop integrated) on Cassandra
• Integrated Near Real-Time Analytics (Spark)
• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..
• Cassandra will replicate the data for you – no ETL is necessary
• Cassandra node started with Solr, Hadoop or Spark
Cassandra
Replication
Transactions Analytics

Enterprise Search & Powerfull Secondary Index
• Built-in enterprise search on Cassandra data via a strong Apache Solr and Lucene
integration
• Facets, Filtering, Geospatial search, Text Analysis, Joins, etc.
• Real-time indexing process and search operations
• Search queries from CQL and REST/Solr
• Solr shortcomings:
• No bottleneck. Client can read/write to any Solr node.
• Search index partitioning and replication for scalability and availability.
• Multi-DC support
• Data durability (Solr lacks write-ahead log, data can be lost)
27
Cassandra
Replication
Customer
Facing
Search
Nodes

Batch Analytics - Hadoop
• Integrated Hadoop 1.0.4
• CFS (Cassandra File System) , no HDFS
• No Single Point of failure
• No Hadoop complexity – every node is built the same
• Hive / Pig / Sqoop / Mahout
Cassandra
Replication
Customer
Facing
Hadoop
Nodes

Real-Time Analytics - Spark
• Tight integration between Apache Spark and Cassandra
• Distributed Processing : “In-memory Map/Reduce”, multi-thread, best for iterations
• GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming (Real-time processing)
• Thrift JDBC/ODBC Spark server – Spark Job server
• Apache Solr integration
• DataStax / Databricks partnership
• 10x – 100x speed of MapReduce
Cassandra
Replication
Customer
Facing
Spark
Nodes
« Big Data » SDK

Real-time or Batch Analytics
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL

Spark Use Cases
31
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion

Workloads Isolation
No ETL

Hot / Cold Data in a DataStax architecture
© 2014 DataStax, All Rights Reserved. Company Confidential
Hot Data
Online Operational Application
Cold Data
Offline Application
DataStax Cassandra Enterprise
34

DataStax Enterprise + Datawarehouse / Hadoop
© 2014 DataStax, All Rights Reserved. Company Confidential
Write Intensive
Internet of Things - Activity logs
for fraud and recommendation –
Messages
35
Read Intensive
Catalogue – Playlist –
Recommendation – Fraud
Alert – Personalization
Operational Search,
Dashboard and Reporting
Offline Applications
Historical Analysis - OLAP -
Complex Analytics – Self
Service BI
Operational Search,
Dashboard and Reporting
Data Warehouse
Hadoop cluster
Computation Engine
Multidimensional Cube

Ooyala Use Case : Hadoop + Cassandra
Company Confidential 37
By leveraging data stored in Apache Cassandra, Ooyala is helping their customers take a more strategic
approach when delivering a digital video experience, so they can get ahead in this fast-evolving space.
http://www.datastax.com/resources/casestudies/ooyala
San Francisco-based video services company Ooyala provides a suite of technologies and services that support content
owners in managing, analyzing and monetizing the digital video they publish online, on mobile devices, and through the over-
the-top distribution platform for delivering Internet video to television.

Spotify Use Case : Hadoop + Cassandra
Company Confidential 38
https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/
Personalization at Spotify using Cassandra

Thanks
We power the big data apps
that transform business.
©2013 DataStax Confidential. Do not distribute without consent.

BI, Reporting and Analytics on Apache Cassandra

In this document