Data Exploration On Databricks - Databricks

The document outlines the process of data exploration on Databricks, specifically focusing on parsing weblogs using regular expressions to create an external table. It provides a detailed example of the table creation process, including the SerDe definition and the necessary SQL commands. Additionally, it highlights the use of Spark SQL for querying the structured weblog data efficiently.

Uploaded by

Tuan Minh Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views1 page

Data Exploration On Databricks - Databricks

Uploaded by

Tuan Minh Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

5/5/2020 Data Exploration on Databricks - Databricks

Data Exploration on Databricks

(http://databricks.com)  Import Notebook

Parsing weblogs with regular expressions to create a table

Original Format: %s %s %s [%s] \"%s %s HTTP/1.1\" %s %s
Example Web Log Row
10.0.0.213 - 2185662 [14/Aug/2015:00:05:15 -0800] "GET /Hurricane+Ridge/rss.xml HTTP/1.1" 200 288

Create External Table

Create an external table against the weblog data where we define a regular expression format as part of the
serializer/deserializer (SerDe) definition. Instead of writing ETL logic to do this, our table definition handles this.

> DROP TABLE IF EXISTS weblog;

CREATE EXTERNAL TABLE weblog (
ipaddress STRING,
clientidentd STRING,
userid STRING,
datetime STRING,
method STRING,
endpoint STRING,
protocol STRING,
responseCode INT,
contentSize BIGINT
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \\"(\\S+) (\\S+) (\\S+)\\" (\\d{3})
(\\d+)'
)
LOCATION
"/mnt/my-data/apache"

Note: You can run a CACHE TABLE statement to help speed up the performance of the table
you query regularly.

> CACHE TABLE weblog;

Query your weblogs using Spark SQL

Instead of parsing and extracting out the datetime, method, endpoint, and protocol columns; the external table has already
done this for you. Now you can treat your weblog data similar to how you would treat any other structured dataset and write
Spark SQL against the table.

> select * from weblog limit 10;

ipaddress clientidentd userid datetime method endpoint

10.0.0.127 - 2696232 14/Aug/2015:00:00:26 -0800 GET /index.html
10.0.0.104 - 2404465 14/Aug/2015:00:01:14 -0800 GET /Cascades/rss.xml
10.0.0.108 - 2404465 14/Aug/2015:00:04:21 -0800 GET /Olympics/rss.xml
10.0.0.213 - 2185662 14/Aug/2015:00:05:15 -0800 GET /Hurricane+Ridge/rss.xml

https://cdn2.hubspot.net/hubfs/438089/notebooks/Samples/Data_Exploration/Data_Exploration_on_Databricks.html 1/1

Data Exploration On Databricks (Setup) - Databricks
No ratings yet
Data Exploration On Databricks (Setup) - Databricks
1 page
Getting Started With Databricks
No ratings yet
Getting Started With Databricks
39 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Overview of Databricks SQL
No ratings yet
Overview of Databricks SQL
30 pages
Spark 4.0
100% (1)
Spark 4.0
123 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Pyspark
No ratings yet
Pyspark
31 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Lab 5 Correlate Structured W Unstructured Data
No ratings yet
Lab 5 Correlate Structured W Unstructured Data
5 pages
SparkDataFrames 250719 202947
No ratings yet
SparkDataFrames 250719 202947
11 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
11 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Data Frames
No ratings yet
Data Frames
12 pages
(Exam) Data Engineering Certification Prep Guide - Partners
No ratings yet
(Exam) Data Engineering Certification Prep Guide - Partners
15 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
YouTube Data Analysis Using Hadoop
No ratings yet
YouTube Data Analysis Using Hadoop
64 pages
Databricks Lakehouse Guide
No ratings yet
Databricks Lakehouse Guide
149 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
No ratings yet
Facebook's Petabyte Scale Data Warehouse Using Hive and Hadoop
40 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
SparkSQL Extensions for Huawei
No ratings yet
SparkSQL Extensions for Huawei
39 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
HBase and Hive at StumbleUpon Presentation
No ratings yet
HBase and Hive at StumbleUpon Presentation
22 pages
MSBTE Questions of BDA
No ratings yet
MSBTE Questions of BDA
24 pages
Big Data Overview
No ratings yet
Big Data Overview
39 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Introduction To Databricks A Beginneers Guide
No ratings yet
Introduction To Databricks A Beginneers Guide
20 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
27 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Getting Data
No ratings yet
Getting Data
54 pages
Data and Analytics - TechM PDF
No ratings yet
Data and Analytics - TechM PDF
8 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
Bda U5
No ratings yet
Bda U5
42 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
D25L2507 - Final Draft - Analyst Roadmap To Databricks - From SQL To End-to-End BI - 1747454926520001ZrON
No ratings yet
D25L2507 - Final Draft - Analyst Roadmap To Databricks - From SQL To End-to-End BI - 1747454926520001ZrON
44 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
Databricks Training
100% (1)
Databricks Training
4 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Apache Spark Programming With Databricks
No ratings yet
Apache Spark Programming With Databricks
112 pages
Mlflow Workshop Part 3
No ratings yet
Mlflow Workshop Part 3
25 pages
04 CaseStudy DataPlatformPeopleStrategy Rao Tom
No ratings yet
04 CaseStudy DataPlatformPeopleStrategy Rao Tom
30 pages
Mlflow Workshop Part 2
No ratings yet
Mlflow Workshop Part 2
29 pages
Dataset - Databricks
No ratings yet
Dataset - Databricks
5 pages
AdTech Sample Notebook (Part 1) - Databricks
No ratings yet
AdTech Sample Notebook (Part 1) - Databricks
1 page
The Sum of Squares Technique
No ratings yet
The Sum of Squares Technique
4 pages
Inmo 2012
No ratings yet
Inmo 2012
6 pages
TS460 - 1 Sales in SAP S/4HANA Academy Part I 1/2
No ratings yet
TS460 - 1 Sales in SAP S/4HANA Academy Part I 1/2
20 pages
CS-1004 OOP Fall 2024
No ratings yet
CS-1004 OOP Fall 2024
5 pages
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
24 pages
Important Questions IP
No ratings yet
Important Questions IP
4 pages
Object Oriented Databases
100% (1)
Object Oriented Databases
26 pages
Prime Faces Users Guide 260710
No ratings yet
Prime Faces Users Guide 260710
411 pages
Mobile Development: 01 Clever Strategy 02 Beautiful Design
No ratings yet
Mobile Development: 01 Clever Strategy 02 Beautiful Design
2 pages
Os Lab Mid Report
No ratings yet
Os Lab Mid Report
16 pages
Architectural Patterns
No ratings yet
Architectural Patterns
23 pages
Unit 1
No ratings yet
Unit 1
72 pages
Userload HTML
No ratings yet
Userload HTML
17 pages
JPOS Version 1 6 PDF
No ratings yet
JPOS Version 1 6 PDF
802 pages
IBMTurbonomic 8.12.2
No ratings yet
IBMTurbonomic 8.12.2
1,720 pages
Linked List Polynomial Addition Guide
No ratings yet
Linked List Polynomial Addition Guide
4 pages
Unit 5
No ratings yet
Unit 5
18 pages
User Guide: Self-Upload Your Supporting Documents
No ratings yet
User Guide: Self-Upload Your Supporting Documents
5 pages
Java SpringBoot Developer Resume
No ratings yet
Java SpringBoot Developer Resume
3 pages
SAE ARP 4754a, 2010
0% (6)
SAE ARP 4754a, 2010
22 pages
AS/400 Document Design Tips
No ratings yet
AS/400 Document Design Tips
4 pages
4 - C# 7.X New Features
No ratings yet
4 - C# 7.X New Features
78 pages
D2T3 - James Forshaw - Introduction To Logical Privilege Escalation On Windows
No ratings yet
D2T3 - James Forshaw - Introduction To Logical Privilege Escalation On Windows
116 pages
Inventory Management System Use Case Diagram
50% (2)
Inventory Management System Use Case Diagram
4 pages
Travel Companion Finder System
No ratings yet
Travel Companion Finder System
13 pages
PHP Fundamentals Guide
No ratings yet
PHP Fundamentals Guide
111 pages
Week 1 1 Introducing Excel Reading
No ratings yet
Week 1 1 Introducing Excel Reading
8 pages
3
No ratings yet
3
5 pages
Advanced Help Desk Software Guide
No ratings yet
Advanced Help Desk Software Guide
6 pages
Cloud Architecture for IT Pros
No ratings yet
Cloud Architecture for IT Pros
13 pages
Ruturaj Vijay Bhosale
No ratings yet
Ruturaj Vijay Bhosale
2 pages
DCOM Configuration Guide
No ratings yet
DCOM Configuration Guide
22 pages

Data Exploration On Databricks - Databricks

Uploaded by

Data Exploration On Databricks - Databricks

Uploaded by

5/5/2020 Data Exploration on Databricks - Databricks

Data Exploration on Databricks

Data Exploration on Databricks

Parsing weblogs with regular expressions to create a table

Create External Table

> DROP TABLE IF EXISTS weblog;

> CACHE TABLE weblog;

Query your weblogs using Spark SQL

> select * from weblog limit 10;

ipaddress clientidentd userid datetime method endpoint

You might also like