5/5/2020 Data Exploration on Databricks - Databricks
Data Exploration on Databricks
Data Exploration on Databricks
(http://databricks.com) Import Notebook
Parsing weblogs with regular expressions to create a table
Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288
Create External Table
Create an external table against the weblog data where we define a regular expression format as part of the
serializer/deserializer (SerDe) definition. Instead of writing ETL logic to do this, our table definition handles this.
> DROP TABLE IF EXISTS weblog;
CREATE EXTERNAL TABLE weblog (
ipaddress STRING,
clientidentd STRING,
userid STRING,
datetime STRING,
method STRING,
endpoint STRING,
protocol STRING,
responseCode INT,
contentSize BIGINT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3})
(\\d+)'
)
LOCATION
"/mnt/my-data/apache"
OK
Note: You can run a CACHE TABLE statement to help speed up the performance of the table
you query regularly.
> CACHE TABLE weblog;
OK
Query your weblogs using Spark SQL
Instead of parsing and extracting out the datetime, method, endpoint, and protocol columns; the external table has already
done this for you. Now you can treat your weblog data similar to how you would treat any other structured dataset and write
Spark SQL against the table.
> select * from weblog limit 10;
ipaddress clientidentd userid datetime method endpoint
10.0.0.127 - 2696232 14/Aug/2015:00:00:26 -0800 GET /index.html
10.0.0.104 - 2404465 14/Aug/2015:00:01:14 -0800 GET /Cascades/rss.xml
10.0.0.108 - 2404465 14/Aug/2015:00:04:21 -0800 GET /Olympics/rss.xml
10.0.0.213 - 2185662 14/Aug/2015:00:05:15 -0800 GET /Hurricane+Ridge/rss.xml
https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks.html 1/1