5/5/2020 Data Exploration on Databricks (Setup) - Databricks
Data Exploration on Databricks (Setup)
Data Exploration on Databricks (Setup)
(http://databricks.com) Import Notebook
Parsing weblogs with regular expressions to create a table
Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288
Setup Instructions
Please refer to the Data Exploration on Databricks How-To Guide for the location of the source files to import for this
notebook.
Please refer to the Databricks Data Import How-To Guide on how to import data into S3 for use with Databricks
notebooks.
> import urllib
ACCESS_KEY = "[REPLACE_WITH_ACCESS_KEY]"
SECRET_KEY = "[REPLACE_WITH_SECRET_KEY]"
ENCODED_SECRET_KEY = urllib.quote(SECRET_KEY, "")
AWS_BUCKET_NAME = "my-data-for-databricks"
MOUNT_NAME = "my-data"
Sample Apache Access Web Logs
> display(dbutils.fs.ls("/mnt/my-data/apache"))
path name
dbfs:/mnt/my-data/apache/ex20150814.log ex20150814
dbfs:/mnt/my-data/apache/ex20150815.log ex20150815
> myApacheLogs = sc.textFile("/mnt/my-data/apache")
myApacheLogs.take(10)
Out[11]:
[u'10.0.0.127 - 2696232 [14/Aug/2015:00:00:26 -0800] "GET /index.html HTTP/1.1" 304 428',
u'10.0.0.104 - 2404465 [14/Aug/2015:00:01:14 -0800] "GET /Cascades/rss.xml HTTP/1.1" 304 514',
u'10.0.0.108 - 2404465 [14/Aug/2015:00:04:21 -0800] "GET /Olympics/rss.xml HTTP/1.1" 200 499',
u'10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288',
u'10.0.0.203 - 2185662 [14/Aug/2015:00:05:17 -0800] "GET /index.html HTTP/1.1" 200 212',
u'10.0.0.104 - 2696232 [14/Aug/2015:00:06:09 -0800] "GET /Cascades/rss.xml HTTP/1.1" 304 420',
u'10.0.0.206 - 2576242 [14/Aug/2015:00:08:40 -0800] "GET /index.html HTTP/1.1" 304 343',
u'10.0.0.213 - 2185662 [14/Aug/2015:00:09:07 -0800] "GET /Olympics/rss.xml HTTP/1.1" 304 323',
u'10.0.0.212 - 2404465 [14/Aug/2015:00:10:29 -0800] "GET /index.html HTTP/1.1" 304 530',
u'10.0.0.114 - 2575718 [14/Aug/2015:00:11:22 -0800] "GET /index.html HTTP/1.1" 304 341']
Sample Web Response Codes
> display(dbutils.fs.ls("/mnt/my-data/response"))
path nam
dbfs:/mnt/my-data/response/responsecodes.txt respo
https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks_Setup.html 1/1