DataStax Sandbox Tutorial
DataStax Sandbox Tutorial
A Guided Tutorial
June 2016
Table of Contents
TABLE OF CONTENTS 2
WELCOME 3
WHAT IS APACHE CASSANDRA? 3
WHAT IS DATASTAX ENTERPRISE? 4
ABOUT THIS TUTORIAL 4
SESSION 1: GETTING STARTED WITH THE DATASTAX SANDBOX 4
To Learn More 5
SESSION 2: CREATING AND QUERYING DATABASE OBJECTS WITH DATASTAX
DEVCENTER 5
To Learn More 7
SESSION 3: QUERYING CASSANDRA OBJECTS FROM THE COMMAND LINE 7
To Learn More 9
SESSION 4: MONITORING CASSANDRA AND DATASTAX ENTERPRISE WITH
DATASTAX OPSCENTER 9
The Learn More 11
SESSION 5: RUNNING ANALYTICS ON CASSANDRA DATA 11
To Learn More 12
SESSION 6: RUNNING SEARCH OPERATIONS ON CASSANDRA DATA 12
To Learn More 12
SESSION 7: GETTING STARTED WITH DSE GRAPH 13
To Learn More 16
WRAP UP 16
CONCLUSION 16
ABOUT DATASTAX 16
The DataStax Sandbox is configured so as to contain a single node of DSE running Cassandra as the
default node type. You can switch node types to analytic and search easily to explore how they work.
The DataStax Sandbox runs on either Oracle VM Virtual box or VMware Fusion and requires at least
20GB of disk space, a 64-bit operating system, and 8GB of RAM for all non-DSE Graph functionality,
while running the Sandbox with DSE Graph enabled requires 16GB of RAM.
NOTE: The DataStax Sandbox is NOT intended nor configured for production deployments and
performance testing.
Cassandra provides automatic data distribution across all nodes that participate in a “ring” or database
cluster. There is no addition work, programmatic or operational, that a developer or administrator needs
to do to distribute data across a cluster.
Instead, Cassandra provides built-in and customizable replication, which stores redundant copies of data
across nodes that participate in a Cassandra ring, whether that cluster is on-premise, in the cloud, or
spans multiple data centers and cloud providers. This means that if any node in a cluster goes down, one
or more copies of that node’s data is available on other machines in the cluster and the database stays
online and remains operational.
Like Cassandra, DSE scales out across multiple nodes and provides full workload isolation so that nodes
designated for online operations do not compete with nodes specified as analytic or search nodes where
resources or data are concerned.
Open your Virtualbox or Vmware Fusion software and import the sandbox image (vm image is all pre-
configured with appropriate RAM & CPU settings). It should take just a couple of minutes to boot the
image and then a login screen is shown.
The VM image will present a Firefox browser with a couple of tabs open. The second tab contains an
introductory welcome message with links at the bottom for DataStax OpsCenter (a visual management
and monitoring solution for Cassandra and DataStax Enterprise) and a copy of this tutorial.
Minimizing the browser shows the VM image’s desktop. The VM’s desktop contains a number of folders
and icons that enable you to easily try out various parts of the sandbox. For example, to check that DSE
is running and is ready for database operations, perform the following:
1. Locate the utilities folder on the desktop and double-click to open it.
2. Double click on the Check Node Status icon.
The nodetool utility of Cassandra is executed; you should see a window with output that resembles Fig 1.
Note: the line starting with UN confirms that the Cassandra node is running.
To Learn More
For more introductory information on Cassandra and DataStax Enterprise, please reference the following
resources:
The basic database objects that you will routinely interact with are:
• Keyspace – Serves as a container for database objects such as tables and indexes, and is
where the level of replication is set. It is analogous to a Microsoft SQL Server or MySQL
database.
• Table – Sometimes referred to in Cassandra literature as a column family, it is the primary object
used to store data. A Cassandra table looks a lot like an RDBMS table on the surface, but
actually it is a sparse data object that provides much more flexibility.
• Index – Akin to an index in an RDBMS, it is a mechanism used to improve the performance of
some queries.
There are other objects in Cassandra, but the above three are the most common with which you will work.
DataStax DevCenter operates in the same way as various GUI tools for RDBM’s (e.g. TOAD for Oracle,
SQL Server Query Analyzer, MySQL Workbench). DevCenter automatically connects you to the running
DSE instance in the VM. For this exercise, you will create a new keyspace, insert data into a number of
tables, and run a query against a table.
1. Locate the first tab in DevCenter’s query interface (labeled ‘Sample Data Modeling’). Click on it to
give it focus in the interface. Alternatively you can double click on the “1-Sample Data
Modelling.cql” script displayed in the CQL Scripts panel. This script will create a new keyspace
and a number of tables/indexes.
2. Notice how the CQL in the interface greatly resembles DDL in SQL.
3. Click on the green arrow icon to execute the script.
4. In the status bar of the Results pane, you will see a message at the bottom of: “10 statement(s)
successfully executed.”
5. Notice there is now a new keyspace labeled “videodb” in the Schema Navigator pane (right hand
side).
6. Click on the arrow in the Schema Navigator to view the new tables you have just created.
1. Click on the second tab in DevCenter’s query interface (labeled ‘Sample Inserts’)
2. Click on the green arrow icon to execute the script.
3. In the Results pane, you will see a message at the bottom of: “51 statement(s) successfully
executed.”
1. Click on the third tab in DevCenter’s query interface (labeled ‘Sample Queries’)
2. This tab contains a variety of sample queries you can run against your new tables.
3. Go up under the File menu and choose New CQL Script. This will open a new query tab for you.
4. Type the following into the interface: select * from videodb.users; If you press
Ctrl+space when writing this query the code completion popup will show up and it can help you
write queries faster.
5. Click the green arrow icon to execute your query.
6. Observe the rows returned in the Results portion of the interface.
To Learn More
For more information on Cassandra’s data model, designing NoSQL applications, the Cassandra Query
Language (CQL) and DataStax DevCenter, please visit:
In addition to the graphical DataStax DevCenter tool, you can create, manage, and query Cassandra
objects from a command line tool – the CQL shell or cqlsh. To open the cqlsh tool in your VM:
You will see some informational messages at the top of the utility regarding the version of Cassandra and
CQL to which you are connected.
Now type help; and hit the enter key. You will see a list of CQL commands that you can use inside the
utility. To get more information about each one, type help and hit the enter key.
Now, let’s use the cqlsh tool to get some information about a certain table and then query that table:
1. Type use videodb; inside the utility and hit enter. You have now switched the context of the tool to
use the videodb keyspace.
2. Type desc table users; and hit enter. This command will show you the DDL used to create the
table.
3. Type select * from users; and hit enter. This query will pull back all rows for the users table.
Now type exit; and hit the enter key. This will disconnect you from Cassandra and the cqlsh tool and
return you to a terminal prompt. You can type exit again at the prompt to close the terminal window.
To Learn More
There is much more you can do with CQL and the cqlsh tool. For more information on CQL and the cqlsh
tool, please refer to the following:
DataStax OpsCenter is a visual management and monitoring solution for Cassandra and DataStax
Enterprise. DataStax OpsCenter can be installed on any server – on premise or in the cloud – that has
connectivity to clusters running Cassandra or DataStax Enterprise.
Each node in a Cassandra or DataStax Enterprise cluster contains a DataStax agent, which
communicates with the central OpsCenter service. The DataStax agent and OpsCenter service work
together to monitor and handle tasks on every managed cluster.
OpsCenter provides a Web-based console from which everything can be centrally managed. The
OpsCenter interface provides a visual point-and-click environment for quickly carrying out many
administration and performance monitoring activities.
1. Go to your VM’s desktop and locate the Launch DataStax OpsCenter icon.
2. Double click the icon. Doing so will invoke the Firefox browser and present you with the
OpsCenter dashboard:
1. Click on the Nodes icon in the left hand management navigation pane (which looks like a 4-leaf
clover).
This will show you an alternative graphical dashboard of your cluster and will display a ring graphic with
one green circle (which represents a database node). If you take your mouse pointer and hover over a
green circle/node, OpsCenter will present demographic information about that node.
You can explore all the various core OpsCenter features by using the functions listed on the left hand
management navigation pane:
• Nodes (Ring or List view) – lets you navigate a cluster’s nodes and perform various actions on
them (e.g. start, stop, etc.).
• Activities – lets you check out activities being carried out on the cluster as well as the event log
that lists all actions that have occurred.
• Data – allows you to run backup/restore operations and view/create data objects in the cluster.
• Services – lets you graphically manage the various DataStax server services running on the
cluster as well as utilize the Best Practice service that helps those new to DataStax automatically
tune and optimize their database clusters.
There are also functions listed across the top of OpsCenter that are used to visually create new database
clusters and perform other actions.
DataStax Enterprise provides built-in integration with Spark to run near real-time analytics on Cassandra
data as well as a number of Hadoop components (MapReduce, Hive, Pig, Mahout, Sqoop) that allows
you to run batch analytics on Cassandra data. DSE provides complete workload isolation for analytics
operations so that nodes designated as analytics nodes will not conflict or compete with online/Cassandra
nodes (or enterprise search/Solr nodes) for compute resources or data.
To run analytics on Cassandra data in your VM, you can use the weather sensor demo that is bundled
with DataStax Enterprise. The demo simulates a weather sensor collection and analytics application. To
use the demo, perform the following:
1. Locate the folder on the VM desktop labeled “Weather Sensor Demo” and open it.
2. Double click on the “Start with Spark Analytic Node” icon, which will stop your existing Cassandra
instance and restart the node as an analytics (or Spark enabled) node. Minimize the window that
is left up.
3. Double click on the “Load Weather Sensor Demo Data”, which will load sample data into your
new analytics node. This will take a few minutes to complete. You can type ‘exit’ to exit the
command shell once the data loading process completes.
4. Double click on the “Start Spark Service” icon. Minimize the window afterwards.
To Learn More
For more information on running analytics on Cassandra data in DSE using Spark, please refer to the
following:
• DSE documentation (see section of DSE docs entitled “Analyzing Data Using Spark” under the
“DSE Analytics” link).
DataStax Enterprise supplies the ability to easily run enterprise search operations on Cassandra data
with its built in Solr integration. DSE provides complete workload isolation for search tasks so that nodes
designated as search nodes will not conflict or compete with online/Cassandra nodes (or analytics nodes)
for compute resources or data.
Your VM comes with a demo of enterprise search functionality. To run through the demo, perform the
following:
1. Locate the folder on the VM desktop labeled “Wikipedia Demo Showing Solr” and open it.
2. Double click on the “Start Solr Node” icon, which restarts your VM’s node as a search node.
Minimize the window after you open it.
3. Double click on the “Create Schema and Index” icon, which creates a sample schema with data
that can be searched. You can close the window once it finishes loading its 3,000 sample records
from Wikipedia.
4. Double click on the “View Sample Search Screen”, which brings up a simple browser window
designed to act as a front-end search application.
5. Type “north” into the Search widget provided and hit enter. On the right hand side, results will be
provided from DSE/Solr that contain Wikipedia articles that have the word “north” in it. You can
click on the “wikipedia article” link to see the article in Wikipedia if you are connected to the Web.
To Learn More
There is much more to DSE’s built in enterprise search capabilities than what the simple demo above has
shown. For more information on running enterprise search on Cassandra data in DSE using Solr, please
refer to the following:
• DSE documentation (see section of DSE docs entitled “DSE Search/Solr” under the “Integrated
Solutions” link).
DSE Graph is a graph database built for cloud applications that need to manage complex data and its
many relationships. DSE Graph delivers continuous uptime along with predictable performance and
scale, while remaining operationally simple to manage.
For this tutorial, you will use DataStax Studio, which is a web-based tool for visually interacting with DSE
Graph. With DataStax Studio, you will create a sample graph schema with data, query data from the
graph, and visually transform the data into a variety of charts. To get started:
1. Locate the Launch Graph Node icon on the Sandbox desktop and double-click it. This will start a
DSE node that is graph enabled on your VM.
2. Locate the DataStax Studio icon on the Sandbox desktop and double-click it. This will invoke a
tab in the VM’s browser, which will run DataStax Studio.
DataStax Studio will present an interface in your browser like the following:
DataStax Studio allows you to create and save multiple ‘notebooks’ that contain code and queries that
you run against DSE Graph. DataStax Studio automatically connects you to the running DSE node on
your VM running DSE Graph.
Notebooks in DataStax Studio are broken up into “cells” that typically contain code and queries executed
against DSE Graph along with any result sets from them. DSE Graph uses the Gremlin language – which
is the open source standard for graph databases – to interact with DSE Graph.
The third cell contains a set of Gremlin statements use to create a sample graph. To create the graph,
move your mouse pointer down into the cell and notice that a set of options appears on the right hand
side of the cell. Click on the arrow or “execute” icon, which will run the statements needed to create your
sample graph:
Once DataStax Studio has finished creating your sample graph, you can get a visual picture of the small
graph by clicking on the “Schema” icon in the upper right hand corner of DataStax Studio, which will
display the following:
If you move your mouse pointer over the objects displayed in the schema graphic, DataStax Studio will
provide you with a pop-up box that contains metadata about the object. You can click on the Schema icon
again to remove the schema graphic from view.
Now, scroll down to the following set of cells. The next step in the tutorial allows you to select all the data
from the graph using a simple Gremlin query, which is presented in a grid format:
Fig 11 – Querying all the data from the newly created graph
To Learn More
For more introductory information on DSE Graph, please reference the following resources:
Wrap Up
Once you have completed all the exercises, you can shutdown your VM by choosing System->Shut
down… from the main menu.
To return the VM to its original state, you can open the Utilities folder on the desktop and double-click on
the “Clear All Data” icon.
Conclusion
The DataStax Sandbox provides a basic hands-on overview of DataStax software. The recommended
next steps for you are (1) to enroll in the DataStax free online training (DataStax Academy) that provides
self-paced instruction and exercises designed to help ground you in creating applications for DSE and
Cassandra; (2) Follow up with the recommended resources in each of the above sections and visit the
DataStax website for additional materials.
About DataStax
DataStax, the leading provider of database software for cloud applications, accelerates the ability of
enterprises, government agencies, and systems integrators to power the exploding number of cloud
applications that require data distribution across datacenters and clouds, by using our secure,
operationally simple platform built on Apache Cassandra™.
With more than 500 customers in over 50 countries, DataStax is the database technology of choice for
the world’s most innovative companies, such as Netflix, Safeway, ING, Adobe, Intuit, Target and eBay.
Based in Santa Clara, Calif., DataStax is backed by industry-leading investors including Comcast
Ventures, Crosslink Capital, Lightspeed Venture Partners, Kleiner Perkins Caufield & Byers, Meritech
Capital, Premji Invest and Scale Venture Partners. For more information, visit DataStax.com or follow us
@DataStax. 06.27.16