KEMBAR78
Data Exploration On Databricks (Setup) - Databricks | PDF | Web Standards | Software Engineering
0% found this document useful (0 votes)
14 views1 page

Data Exploration On Databricks (Setup) - Databricks

The document provides setup instructions for data exploration on Databricks, specifically focusing on parsing weblogs using regular expressions. It includes code snippets for importing data into S3 and accessing Apache access web logs. Additionally, it mentions sample web response codes available for analysis.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views1 page

Data Exploration On Databricks (Setup) - Databricks

The document provides setup instructions for data exploration on Databricks, specifically focusing on parsing weblogs using regular expressions. It includes code snippets for importing data into S3 and accessing Apache access web logs. Additionally, it mentions sample web response codes available for analysis.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

5/5/2020 Data Exploration on Databricks (Setup) - Databricks

Data Exploration on Databricks (Setup)

Data Exploration on Databricks (Setup)


(http://databricks.com)  Import Notebook

Parsing weblogs with regular expressions to create a table


Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288

Setup Instructions
Please refer to the Data Exploration on Databricks How-To Guide for the location of the source files to import for this
notebook.
Please refer to the Databricks Data Import How-To Guide on how to import data into S3 for use with Databricks
notebooks.

> import urllib


ACCESS_KEY = "[REPLACE_WITH_ACCESS_KEY]"
SECRET_KEY = "[REPLACE_WITH_SECRET_KEY]"
ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = "my-data-for-databricks"
MOUNT_NAME = "my-data"

Sample Apache Access Web Logs


> display(dbutils.fs.ls("/mnt/my-data/apache"))

path name
dbfs:/mnt/my-data/apache/ex20150814.log ex20150814
dbfs:/mnt/my-data/apache/ex20150815.log ex20150815

> myApacheLogs = sc.textFile("/mnt/my-data/apache")


myApacheLogs.take(10)

Out[11]:
[u'10.0.0.127 - 2696232 [14/Aug/2015:00:00:26 -0800] "GET /index.html HTTP/1.1" 304 428',
u'10.0.0.104 - 2404465 [14/Aug/2015:00:01:14 -0800] "GET /Cascades/rss.xml HTTP/1.1" 304 514',
u'10.0.0.108 - 2404465 [14/Aug/2015:00:04:21 -0800] "GET /Olympics/rss.xml HTTP/1.1" 200 499',
u'10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288',
u'10.0.0.203 - 2185662 [14/Aug/2015:00:05:17 -0800] "GET /index.html HTTP/1.1" 200 212',
u'10.0.0.104 - 2696232 [14/Aug/2015:00:06:09 -0800] "GET /Cascades/rss.xml HTTP/1.1" 304 420',
u'10.0.0.206 - 2576242 [14/Aug/2015:00:08:40 -0800] "GET /index.html HTTP/1.1" 304 343',
u'10.0.0.213 - 2185662 [14/Aug/2015:00:09:07 -0800] "GET /Olympics/rss.xml HTTP/1.1" 304 323',
u'10.0.0.212 - 2404465 [14/Aug/2015:00:10:29 -0800] "GET /index.html HTTP/1.1" 304 530',
u'10.0.0.114 - 2575718 [14/Aug/2015:00:11:22 -0800] "GET /index.html HTTP/1.1" 304 341']

Sample Web Response Codes


> display(dbutils.fs.ls("/mnt/my-data/response"))

path nam
dbfs:/mnt/my-data/response/responsecodes.txt respo

https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks_Setup.html 1/1

You might also like