KEMBAR78
Data Exploration On Databricks - Databricks | PDF | Software Engineering | Information Technology Management
0% found this document useful (0 votes)
33 views1 page

Data Exploration On Databricks - Databricks

The document outlines the process of data exploration on Databricks, specifically focusing on parsing weblogs using regular expressions to create an external table. It provides a detailed example of the table creation process, including the SerDe definition and the necessary SQL commands. Additionally, it highlights the use of Spark SQL for querying the structured weblog data efficiently.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views1 page

Data Exploration On Databricks - Databricks

The document outlines the process of data exploration on Databricks, specifically focusing on parsing weblogs using regular expressions to create an external table. It provides a detailed example of the table creation process, including the SerDe definition and the necessary SQL commands. Additionally, it highlights the use of Spark SQL for querying the structured weblog data efficiently.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

5/5/2020 Data Exploration on Databricks - Databricks

Data Exploration on Databricks

Data Exploration on Databricks


(http://databricks.com)  Import Notebook

Parsing weblogs with regular expressions to create a table


Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288

Create External Table


Create an external table against the weblog data where we define a regular expression format as part of the
serializer/deserializer (SerDe) definition. Instead of writing ETL logic to do this, our table definition handles this.

> DROP TABLE IF EXISTS weblog;


CREATE EXTERNAL TABLE weblog (
ipaddress STRING,
clientidentd STRING,
userid STRING,
datetime STRING,
method STRING,
endpoint STRING,
protocol STRING,
responseCode INT,
contentSize BIGINT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3})
(\\d+)'
)
LOCATION
"/mnt/my-data/apache"

OK

Note: You can run a CACHE TABLE statement to help speed up the performance of the table
you query regularly.

> CACHE TABLE weblog;

OK

Query your weblogs using Spark SQL


Instead of parsing and extracting out the datetime, method, endpoint, and protocol columns; the external table has already
done this for you. Now you can treat your weblog data similar to how you would treat any other structured dataset and write
Spark SQL against the table.

> select * from weblog limit 10;

ipaddress clientidentd userid datetime method endpoint


10.0.0.127 - 2696232 14/Aug/2015:00:00:26 -0800 GET /index.html
10.0.0.104 - 2404465 14/Aug/2015:00:01:14 -0800 GET /Cascades/rss.xml
10.0.0.108 - 2404465 14/Aug/2015:00:04:21 -0800 GET /Olympics/rss.xml
10.0.0.213 - 2185662 14/Aug/2015:00:05:15 -0800 GET /Hurricane+Ridge/rss.xml

https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks.html 1/1

You might also like