KEMBAR78
Cloudera Msazure Hadoop Deployment Guide | PDF | Apache Hadoop | Apache Solr
0% found this document useful (0 votes)
251 views39 pages

Cloudera Msazure Hadoop Deployment Guide

This document provides a tutorial on getting started with Hadoop deployment. It discusses ingesting and querying relational transaction data from a MySQL database using Apache Sqoop and Apache Impala. Sqoop is used to import the data from MySQL into HDFS in Parquet format while preserving the schema. Impala is then used to issue SQL queries on the imported data, such as finding the most popular product categories or top revenue generating products. The tutorial demonstrates how to perform the same types of analyses on big data that are typically done relationally, at larger scale and lower cost using Hadoop.

Uploaded by

Kristof
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views39 pages

Cloudera Msazure Hadoop Deployment Guide

This document provides a tutorial on getting started with Hadoop deployment. It discusses ingesting and querying relational transaction data from a MySQL database using Apache Sqoop and Apache Impala. Sqoop is used to import the data from MySQL into HDFS in Parquet format while preserving the schema. Impala is then used to issue SQL queries on the imported data, such as finding the most popular product categories or top revenue generating products. The tutorial demonstrates how to perform the same types of analyses on big data that are typically done relationally, at larger scale and lower cost using Hadoop.

Uploaded by

Kristof
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CLOUDER A DE PLOY M E N T G U IDE

Getting Started with


Hadoop Tutorial
Table of contents
Setup ...................................................................................................... 2-10

Showing big data value ......................................................................... 11-15

Showing data hub value .............................................................................. 16

Advanced analytics on the same platform ............................................. 17-29

Data governance and compliance ......................................................... 30-37

The End Game ............................................................................................. 38


Setup
For the remainder of this tutorial, we will present examples in the context of a fictional corporation called DataCo. Our mission is to help this organization get
better insight by asking bigger questions.

SCENARIO:
Your Management: is talking euphorically about Big Data.

You: are carefully skeptical, as it will most likely all land on your desk anyway.
Or, it has already landed on you, with the nice project description of: Go figure
this Hadoop thing out.

PREPARATION:
Verify your environment. Go to Cloudera Manager in your demo environment and
make sure the following services are up and running (have a green status dot next
to them in the Cloudera Manager HOME Status view):
• Apache Impala - You will use this for interactive query
• Apache Hive - You will use for structure storage (i.e. tables in the Hive
metastore)
• HUE - You will use for end user query access
• HDFS - You will use for distributed data storage
• YARN – This is the processing framework used by Hive (includes MR2)

If any of the services show yellow or red, restart the service or reach out to
this discussion forum for further assistance.
Setup
For the remainder of this tutorial, we will present examples in the context of a fictional corporation called DataCo. Our mission is to help this organization get
better insight by asking bigger questions.

STARTING / RESTARTING A SERVICE:


1. Click on the dropdown menu to the right of the service name.
2. Click on Start or Restart.
3. Wait for your service to turn to green

Now that you have verified that your services are healthy and showing green,
you can continue.
Exercise 1: Ingest and query
relational data
In this scenario, DataCo’s business question is: What products do our
customers like to buy? To answer this question, the first thought might be to
look at the transaction data, which should indicate what customers do buy
and like to buy, right?

This is probably something you can do in your regular RDBMS environment, but
a benefit of Apache Hadoop is that you can do it at greater scale at lower cost,
on the same system that you may also use for many other types of analysis.

What this exercise demonstrates is how to do the same thing you already
know how to do, but in CDH. Seamless integration is important when
evaluating any new infrastructure. Hence, it’s important to be able to do what
you normally do, and not break any regular BI reports or workloads over the
dataset you plan to migrate.

To analyze the transaction data in the new platform, we need to ingest it into
the Hadoop Distributed File System (HDFS). We need to find a tool that easily
transfers structured data from a RDBMS to HDFS, while preserving structure.
That enables us to query the data, but not interfere with or break any regular
workload on it.
Exercise 1: Ingest and query
relational data
Apache Sqoop, which is part of CDH, is that tool. The nice thing about This command may take a while to complete, but it is doing a lot. It is
Sqoop is that we can automatically load our relational data from MySQL into launching MapReduce jobs to pull the data from our MySQL database and
HDFS, while preserving the structure. With a few additional configuration write the data to HDFS in parallel, distributed across the cluster in Apache
parameters, we can take this one step further and load this relational data Parquet format. It is also creating tables to represent the HDFS files in Impala/
directly into a form ready to be queried by Apache Impala, the MPP analytic Apache Hive with matching schema.
database included with CDH, and other workloads.
Parquet is a format designed for analytical applications on Hadoop. Instead of
You should first log in to the Master Node of your cluster via a terminal. Then, grouping your data into rows like typical data formats, it groups your data into
launch the Sqoop job: columns. This is ideal for many analytical queries where instead of retrieving
data from specific records, you’re analyzing relationships between specific
variables across many records. Parquet is designed to optimize data storage
> sqoop import-all-tables \ and retrieval in these scenarios.
-m {{cluster_data.worker_node_hostname.length}} \
--connect jdbc:mysql://{{cluster_data.manager_
node_hostname}}:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import
Exercise 1: Ingest and query
relational data
VERIFICATION
When this command is complete, confirm that your data files exist in HDFS.

> hadoop fs -ls /user/hive/warehouse/


> hadoop fs -ls /user/hive/warehouse/categories/

These commands to your right will show the directories and the files inside
them that make up your tables.

Note: The number of .parquet files shown will be equal to what was passed
to Sqoop with the -m parameter. This is the number of ‘mappers’ that Sqoop
will use in its MapReduce jobs. It could also be thought of as the number of
simultaneous connections to your database, or the number of disks / Data
Nodes you want to spread the data across. So, on a single-node you will just
see one, but larger clusters will have a greater number of files.
Exercise 1: Ingest and query
relational data
Hive and Impala also allow you to create tables by defining a schema To save time during queries, Impala does not poll constantly for metadata
over existing files with ‘CREATE EXTERNAL TABLE’ statements, similar to changes. So, the first thing we must do is tell Impala that its metadata is out of
traditional relational databases. But Sqoop already created these tables for date. Then we should see our tables show up, ready to be queried:
us, so we can go ahead and query them.

We’re going to use Hue’s Impala app to query our tables. Hue provides a web- invalidate metadata;
based interface for many of the tools in CDH and can be found on port 8888 show tables;
of your Manager Node. In the QuickStart VM, the administrator username for
Hue is ‘cloudera’ and the password is ‘cloudera’.

Once you are inside of Hue, click on Query Editors, and open the Impala You can also click on the “Refresh Table List” icon on the left to see your new
Query Editor. tables in the side menu.
Exercise 1: Ingest and query
relational data
Now that your transaction data is readily available for structured queries in You should see results of the following form:
CDH, it’s time to address DataCo’s business question. Copy and paste or type
in the following standard SQL example queries for calculating total revenue
per product and showing the top 10 revenue generating products:

-- Most popular product categories


select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id =
p.product_id
inner join categories c on c.category_id = p.product_
category_id
group by c.category_name
order by count desc
limit 10;
Exercise 1: Ingest and query
relational data
Clear out the previous query, and replace it with the following: You should see results similar to this:

-- top 10 revenue generating products


select p.product_id, p.product_name, r.revenue
from products p inner join
(select oi.order_item_product_id, sum(cast(oi.order_item_
subtotal as float)) as revenue
from order_items oi inner join orders o
on oi.order_item_order_id = o.order_id
where o.order_status <> ‘CANCELED’
and o.order_status <> ‘SUSPECTED_FRAUD’
group by order_item_product_id) r
on p.product_id = r.order_item_product_id
order by r.revenue desc
limit 10;

You may notice that we told Sqoop to import the data into Hive but used
Impala to query the data. This is because Hive and Impala can share both
data files and the table metadata. Hive works by compiling SQL queries into
MapReduce jobs, which makes it very flexible, whereas Impala executes
queries itself and is built from the ground up to be as fast as possible, which
makes it better for interactive analysis. We’ll use Hive later for an ETL (extract-
transform-load) workload.
Exercise 1: Ingest and query
relational data
CONCLUSION
Now that you have gone through the first basic steps to Sqoop structured data into HDFS, transform it
into Parquet file format, and create hive tables for use when you query this data.

You have also learned how to query tables using Impala and that you can use regular interfaces and tools
(such as SQL) within a Hadoop environment as well. The idea here being that you can do the same reports
you usually do, but where the architecture of Hadoop vs traditional systems provides much larger scale
and flexibility.
Showing big data value

SCENARIO:
Your Management: is indifferent, you produced what you always produce - a
report on structured data, but you really didn’t prove any additional value.

You: are either also indifferent and just go back to what you have always
done... or you have an ace up your sleeve.

PREPARATION:
Go to Cloudera Manager’s home page and verify the following services are up:
• Impala
• Hive
• HDFS
• Hue
Exercise 2: Correlate structured data
with unstructured data
Since you are a pretty smart data person, you realize another interesting BULK UPLOAD DATA
business question would be: are the most viewed products also the most For your convenience, we have pre-loaded some sample access log data into
sold? Since Hadoop can store unstructured and semi-structured data /opt/examples/log_data/access.log.2. Let’s move this data from the local
alongside structured data without remodeling an entire database, you can filesystem, into HDFS.
just as well ingest, store, and process web log events. Let’s find out what site
visitors have viewed the most.
> sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/
For this, you need the web clickstream data. The most common way to original_access_logs
ingest web clickstream is to use Apache Flume. Flume is a scalable real-time > sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/
ingest framework that allows you to route, filter, aggregate, and do “mini- log_files/access.log.2 /user/hive/warehouse/original_
operations” on data on its way in to the scalable processing platform. access_logs

In Exercise 4, later in this tutorial, you can explore a Flume configuration


example, to use for real-time ingest and transformation of our sample web
clickstream data. However, for the sake of tutorial-time, in this step, we will The copy command may take several minutes to complete. Verify that your
not have the patience to wait for three days of data to be ingested. Instead, data is in HDFS by executing the following command:
we prepared a web clickstream data set (just pretend you fast forwarded
three days) that you can bulk upload into HDFS directly.

> hadoop fs -ls /user/hive/warehouse/original_access_logs


Exercise 2: Correlate structured data
with unstructured data
Now you can build a table in Hive and query the data via Apache Impala and
Hue. You’ll build this table in 2 steps. First, you’ll take advantage of Hive’s flexible
‘input.regex’ = ‘([^ ]*) - - \\[([^\\]]*)\\]
SerDes (serializers / deserializers) to parse the logs into individual fields using a
“([^\ ]*) ([^\ ]*) ([^\ ]*)” (\\d*) (\\d*) “([^”]*)”
regular expression. Second, you’ll transfer the data from this intermediate table
“([^”]*)”’,
to one that does not require any special SerDe. Once the data is in this table, you
‘output.format.string’ = “%1$$s %2$$s %3$$s %4$$s
can query it much faster and more interactively using Impala.
%5$$s %6$$s %7$$s %8$$s %9$$s”)
LOCATION ‘/user/hive/warehouse/original_access_logs’;
We’ll use the Hive Query Editor app in Hue to execute the following queries:

CREATE EXTERNAL TABLE tokenized_access_logs (


ip STRING,
CREATE EXTERNAL TABLE intermediate_access_logs ( date STRING,
ip STRING, method STRING,
date STRING, url STRING,
method STRING, http_version STRING,
url STRING, code1 STRING,
http_version STRING, code2 STRING,
code1 STRING, dash STRING,
code2 STRING, user_agent STRING)
dash STRING, ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
user_agent STRING) LOCATION ‘/user/hive/warehouse/tokenized_access_logs’;
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib. ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;
serde2.RegexSerDe’ INSERT OVERWRITE TABLE tokenized_access_logs SELECT *
WITH SERDEPROPERTIES ( FROM intermediate_access_logs;
Exercise 2: Correlate structured data
with unstructured data
The final query will take a minute to run. It is using a MapReduce job, just By introspecting the results you quickly realize that this list contains many of
like our Sqoop import did, to transfer the data from one table to the other in the products on the most sold list from previous tutorial steps, but there is one
parallel. You can follow the progress in the log below, and you should see the product that did not show up in the previous result. There is one product that
message ‘The operation has no results.’ when it’s done. seems to be viewed a lot, but never purchased. Why?

Again, we need to tell Impala that some tables have been created through Well, in our example with DataCo, once these odd findings are presented to
a different tool. Switch back to the Impala Query Editor app, and enter the your manager, it is immediately escalated. Eventually, someone figures out
following command: that on that view page, where most visitors stopped, the sales path of the
product had a typo in the price for the item. Once the typo was fixed, and a
correct price was displayed, the sales for that SKU started to rapidly increase.

invalidate metadata;

Now, if you enter the ‘show tables;’ query or refresh the table list in the
left-hand column, you should see the two new external tables in the default
database. Paste the following query into the Query Editor:

select count(*),url from tokenized_access_logs


where url like ‘%\/product\/%’
group by url order by count(*) desc;
Exercise 2: Correlate structured data
with unstructured data
CONCLUSION
If you had lacked an efficient and interactive tool enabling analytics on high-volume semi-structured
data, this loss of revenue would have been missed for a long time. There is risk of loss if an organization
looks for answers within partial data. Correlating two data sets for the same business question showed
value and being able to do so within the same platform made life easier for you and for the organization.
If you’d like to dive deeper into Hive, Impala, and other tools for data analysis in Cloudera’s platform, you
may be interested in Data Analyst Training.

For now, we’ll explore some different techniques.


Showing data hub value

SCENARIO:
Your Management: can’t believe the magic you do with data and is about to
promote you and invest in a new team under your lead... when all hell breaks
loose. You get an emergency call - as you are now the go-to person - and your
manager is screaming about the loss of sales over the last three days.

You: from slightly excited to under the gun in seconds...well, lucky for you,
there might be a quick way to find out what is happening.

PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Hue
• Solr
• YARN
Advanced analytics on the
same platform
SCENARIO:
Your Management: is of course thrilled with the recent discoveries you
helped them with—you basically saved them a lot of money! They start giving
you bigger questions, and more funding (we really hope the latter!)

You: are excited to dive into more advanced use cases, but you know that you’ll
need even more funding by the organization. You decide to really show off!

PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Spark
• YARN / MR2
Exercise 3: Explore log events
interactively
Since sales are dropping and nobody knows why, you want to provide a way Solr organizes data similarly to the way a SQL database does. Each record
for people to interactively and flexibly explore data from the website. We can is called a ‘document’ and consists of fields defined by the schema: just
do this by indexing it for use in Apache Solr, where users can do text searches, like a row in a database table. Instead of a table, Solr calls it a ‘collection’
drill down through different categories, etc. Data can be indexed by Solr in of documents. The difference is that data in Solr tends to be more loosely
batch using MapReduce, or you can index tables in Apache HBase and get structured. Fields may be optional, and instead of always matching exact
real-time updates. To analyze data from the website, however, we’re going to values, you can also enter text queries that partially match a field, just like
stream the log data in using Flume. you’re searching for web pages. You’ll also see Hue refer to ‘shards’ - and
that’s just the way Solr breaks collections up to spread them around the
The web log data is a standard web server log which may look something like this: cluster so you can search all your data in parallel.
Exercise 3: Explore log events
interactively
Here is how you can start real-time-indexing via Cloudera Search and Flume over the sample web server log data and use the Search UI in Hue to explore it:

CREATE YOUR SEARCH INDEX


Ordinarily when you are deploying a new search schema, there are four steps:

1. Creating an empty configuration 3. Uploading your configuration

> cd /opt/examples/flume
> solrctl --zk {{zookeeper_connection_string}}/solr > solrctl --zk {{zookeeper_connection_string}}/solr
instancedir --generate solr_configs instancedir --create live_logs ./solr_configs

The result of this command is a skeleton configuration that you can customize 4. Creating your collection
to your liking via the conf/schema.xml.

2. Edit your schema solrctl --zk {{zookeeper_connection_string}}/solr


The most likely area in conf/schema.xml that you would be interested in collection --create live_logs -s {{ number of solr
is the <fields></fields> section. From this area you can define the fields servers }}
that are present and searchable in your index.
Exercise 3: Explore log events
interactively
You may need to replace the IP addresses with those of your three data nodes. Then click on Indexes from the top right to see all the indexes/collections.
You can verify that you successfully created your collection in Solr by going to
Hue, and clicking Search in the top menu
Exercise 3: Explore log events
interactively
Now you can see the collection that we just created, live_logs, click on it. You are now viewing the fields that we defined in our schema.xml file.
Exercise 3: Explore log events
interactively
Now that you have verified that your search collection/index was created successfully, we can start
putting data into it using Flume and Morphlines. Flume is a tool for ingesting streams of data into your
cluster from sources such as log files, network streams, and more. Morphlines is a Java library for doing
ETL on-the-fly, and it’s an excellent companion to Flume. It allows you to define a chain of tasks like
reading records, parsing and formatting individual fields, and deciding where to send them, etc. We’ve
defined a morphline that reads records from Flume, breaks them into the fields we want to search on, and
loads them into Solr (You can read more about Morphlines here).

This example Morphline is defined at /opt/examples/flume/conf/morphline.conf, and we’re going to use


it to index our records in real-time as they’re created and ingested by Flume.
Exercise 3: Explore log events
interactively
APACHE FLUME AND THE MORPHLINE This will start running the Flume agent in the foreground. Once it has started,
Now that we have an empty Solr index, and live log events coming in to our and is processing records, you should see something like:
fake access.log, we can use Flume and morphlines to load the index with the
real-time log data.

The key player in this tutorial is Flume. Flume is a system for collecting,
aggregating, and moving large amounts of log data from many different
sources to a centralized data source.

With a few simple configuration files, we can use Flume and a morphline (a
simple way to accomplish on-the-fly ETL,) to load our data into our Solr index.
(Note: You can use Flume to load many other types of data stores; Solr is just
the example we are using for this tutorial.)

Start the Flume agent by executing the following command:

> flume-ng agent \


--conf /opt/examples/flume/conf \
--conf-file /opt/examples/flume/conf/flume.conf \
--name agent1 \
-Dflume.root.logger=DEBUG,INFO,console
Exercise 3: Explore log events
interactively
Now you can go back to the Hue UI, and click ‘Search’ from the collection’s page: You’ll be able to search, drill down into, and browse events that have been indexed.

If one of these steps fails, please reach out to the Discussion Forum and get help. Otherwise, you can start exploring the log data and understand what is going on.

For our story’s sake, we pretend that you started indexing data the same time as you started ingesting it (via Flume) to the platform, so that when your manager
escalated the issue, you could immediately drill down into data from the last three days and explore what happened. For example, perhaps you noted a lot of
DDOS events and could take the right measures to preempt the attack. Problem solved! Management is fantastically happy with your recent contributions, which
of course leads to a great bonus or something similar. :D
Exercise 4: Building a dashboard

To get started with building a dashboard with Hue, click on the pen icon. This will take you into the edit-mode where you can choose different widgets
and layouts that you would like to see. You can choose a few options and
configurations here, but for now, just drag a barchart into the top gray row.
Exercise 4: Building a dashboard

This will bring up the list of fields that are present in our index so that you can For the sake of our display, choose +15MINUTES for the INTERVAL.
choose which field you would like to group by. Let’s choose request_date.
Exercise 4: Building a dashboard

You aren’t limited to a single column; you can view this as a two-column This time, let’s choose department as the field that we want to group by for our
display as well. Select the two-column layout from the top left. pie chart.

While you’re here, let’s drag a pie chart to the newly created row in the left column.
Exercise 4: Building a dashboard

Things are really starting to take shape! Let’s add a Facet filter to the left- Now that we are satisfied with our changes, let’s click on the pencil icon to
hand side and select product as the facet. exit edit mode.
Exercise 4: Building a dashboard

And save our dashboard. At the Hue project’s blog you can find a wide selection of video tutorials for
accomplishing other tasks in Hue. For instance, you can watch a video of a
similar Search dashboard to this example being created here.

You may also be interested in more advanced training on the technologies


used in this tutorial, and other ways to index data in real-time and query it
with Solr in Cloudera’s Search Training course.
Data governance and compliance

DataCo has moved into bigger business thanks to the Big Data projects you’ve Some people are questioning exactly how the decision was made to change
contributed to. As more and more users start using the Enterprise Data Hub the pricing on the website. You realize this is the perfect chance to prove
you built, it starts getting more complicated to manage and trace data and yourself again.
access to data. In addition, as your previous deliveries created such success,
the company has decided to build out a full EDH strategy, and as a result You need to demonstrate that you can:
lots of sensitive data is headed for your cluster too: credit card transactions, • Easily show who has queried the data
social security data, and other financial records for DataCo. • Show exactly what has been done with it since it was created
• Enforce policies about how it gets managed in the future
Your Management: is worried about security controls and ability to audit the
access for compliance. Cloudera Navigator provides a solution to all these problems. If using
Cloudera Live, you can find a link in the Govern Your Data section of your
You: need to resolve their concerns and you want to make it easier to manage Guidance Page, which your welcome email will direct you to. In using the
who does what on your cluster for back-charging purposes too. QuickStart VM, the username is ‘cloudera’ and the password is ‘cloudera’.
Exercise 5: Cloudera Navigator

DISCOVERY You know that the old web server log data you analyzed was a Hive table, so
The first thing you see when you log into Cloudera Navigator is a search tool. It’s select ‘Hive’ under ‘Source Type’, and ‘Table’ under ‘Type’. You’re also pretty
an excellent way to find data on your cluster, even when you don’t know exactly sure it had ‘access log’ in the name, so enter this search query at the top and
what you’re looking for. Go ahead and click the link to ‘explore your data’. hit enter:

*access*log*

When the results appear, you immediately recognize the tokenized_access_


table. That must be the one you queried!
Exercise 5: Cloudera Navigator

LINEAGE As you click on the nodes in this graph, more detail will appear. If you click on
Now that you’ve found the data you were looking for, click on the table and the tokenized_access_logs table and the intermediate_access_logs table,
you’ll see a graph of the data’s lineage. You’ll see the tokenized_access_logs you’ll see arrows for each individual field running through that query. You can
table on the right and the underlying file with the same name in HDFS in blue. see how quickly you could trace the origin of datasets even in a much busier
You’ll also see the other Hive table you created from the original file and the and more complicated environment!
query you ran to transform the data between the two. (The different colors
represent different source types: yellow data comes from Hive, blue data
comes directly from HDFS.)
Exercise 5: Cloudera Navigator

AUDITING As you can see, there are hundreds of events that have been recorded, each with
Now you’ve shown where the data came from, but we still need to show what’s details of what was done, by whom, and when. Let’s narrow down what we’re
been done with it. Go to the ‘Audits’ tab, using the link in the top-right corner. looking for again. Open the “Filters” menu from below the “Audit Events” heading.
Exercise 5: Cloudera Navigator

Click the + icon twice to add two new filters. For the first filter, set the property You can also view and create reports based on the results of these searches
to ‘Username’ and fill in ‘admin’ as the value. For the second filter, set the on the left-hand corner. There’s already a report called “Recent Denied
property to ‘Operation’ and fill in ‘QUERY’ as the value. Then click ‘Apply’. Accesses”. If you checked that report now, you may see that in the course
of this tutorial, some tools have tried to access a directory called ‘/user/
As you click on the individual results, you can see the exact queries that were anonymous’ that we haven’t set up, and that the services don’t have
executed and all related details. permission to create.
Exercise 5: Cloudera Navigator

POLICIES Click the + icon to add a new policy, name your policy “Tag Insecure Data”.
It’s a relief to be able to audit access to your cluster and see there’s no Check the box to enable the policy, and enter the following as the search query:
unexpected or unauthorized activity going on. But wouldn’t it be even better
if you could automatically apply policies to data? Let’s open the policies tab in
the top-right hand corner and create a policy to make the data we just audited (permissions:”rwxrwxrwx”) AND (sourceType:hdfs) AND
easier to find in the future. (type:file OR type:directory) AND (deleted:false)
Exercise 5: Cloudera Navigator

This query will detect any files in HDFS that allow anyone to read, write, To apply this tag on existing data, set the schedule to “Immediate”, and check the
and execute. It’s common for people to set these permissions to make sure box “Assign Metadata”. Under tags, enter “insecure”, and then click “Add Tag”.
everything works, but your organization may want to refine this practice as you Save the policy.
move into production or implement more strict practices for some data sets.
Exercise 5: Cloudera Navigator

CONCLUSION
You’ve now experienced how to use Cloudera Navigator for discovery of data and metadata. This
powerful tool makes it easy to audit access, trace data lineage, and enforce policies.

With more data, and more data formats available in a multi-tenant environment, data lineage and
governance are getting challenging. Cloudera Navigator provides enterprise-grade governance that’s
built into the foundation of Apache Hadoop.

You can learn more about the various management features provided by Cloudera Manager in the
Cloudera Administrator Training for Apache Hadoop.
The End Game

We hope you have enjoyed this basic tutorial, and that you: NEXT STEPS
• Have a better understanding of some of the popular tools in CDH If you’re ready to install Cloudera’s platform on your own cluster (on premise
• Know how to setup some basic and familiar BI use cases, as well as web or in the public cloud), there are a few options:
log analytics and real-time search • Try the AWS Quick Start for easy deployment of Cloudera’s platform on
• Can explain to your manager why you deserve a raise! AWS clusters via your own account (promo credit available)
• Try the Azure Test Drive for Cloudera Director (3-hour sandbox to
provision EDH clusters on Azure)

You might also like