0% found this document useful (0 votes)

49 views19 pages

Spark SQL - Updated

Uploaded by

vaishnavireddy1809vs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views19 pages

Spark SQL - Updated

Uploaded by

vaishnavireddy1809vs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

1

Spark SQL
• Spark introduces a programming module for structured data
processing called Spark SQL.
• Spark SQL is easier and efficient to load and query structured data.
• It can load data from a variety of structured sources (e.g., JSON,
Hive)
• It lets you query the data using SQL, both inside a Spark program
and from external tools that connect to Spark SQL through standard
database connectors (JDBC/ODBC), such as business intelligence
tools like Tableau.
• When used within a Spark program, Spark SQL provides rich
integration between SQL and regular Python/Java/Scala code,
including the ability to join RDDs and SQL tables, expose custom
functions in SQL
2
Spark SQL
• It provides a programming abstraction called DataFrame and can
act as distributed SQL query engine.
• Spark SQL can be built with or without Apache Hive, the Hadoop
SQL engine.
• SparkSQL with Hive support allows us to access Hive tables, UDFs
(user defined functions).
• When programming against Spark SQL we have two entry points
depending on whether we need Hive support.
• The recommended entry point is the HiveContext to provide access
to HiveQL and other Hive-dependent functionality.
• The more basic SQLContext provides a subset of the Spark SQL
support that does not depend on Hive.

3
Spark SQL
• Support relational processing both within Spark programs (on
native RDDs) and on external data sources using a programmer
friendly API.
• Provide high performance using established DBMS techniques.
• Easily support new data sources, including semi-structured data
and external databases amenable to query federation.
• Enable extension with advanced analytics algorithms such as
graph processing and machine learning.

4
Spark SQL Architecture
• Spark SQL runs as a library on
top of Spark
• It exposes SQL interfaces, which
can be accessed through
JDBC/ODBC or through a
command-line console, as well as
the DataFrame API integrated
into Spark’s supported
programming languages.
• Catalyst optimizer is used to
optimize the performance of
query in spark

5
Spark SQL Architecture
• This architecture contains three layers namely, Language API,
Schema RDD, and Data Sources.
• Language API − Spark is compatible with different languages and
Spark SQL. It is also, supported by these languages- API (python,
scala, java, HiveQL).
• Schema RDD − Spark Core is designed with special data structure
called RDD. Generally, Spark SQL works on schemas, tables, and
records. Therefore, we can use the Schema RDD as temporary
table. We can call this Schema RDD as Data Frame.
• Data Sources − Usually the Data source for spark-core is a text
file, Avro file, etc. However, the Data Sources for Spark SQL is
different. Those are Parquet file, JSON document, HIVE tables, and
Cassandra database.
6
Different types of data sources available in
SparkSQL
Sno Data Source

1 JSON Datasets: Spark SQL can automatically capture the schema of a JSON
dataset and load it as a DataFrame.

2 Hive Tables:
Hive comes bundled with the Spark library as HiveContext, which inherits from
SQLContext.
3 Parquet Files
Parquet is a columnar format, supported by many data processing systems.

A DataFrame interface allows different DataSources to work on Spark SQL. It is a

temporary table and can be operated as a normal RDD. Registering a DataFrame
as a table allows you to run SQL queries over its data.

7
DataFrame
• DataFrame is a distributed collection of data, which is organized
into named columns.
• Conceptually, it is equivalent to relational tables with good
optimization techniques.
• A DataFrame can be constructed from an array of different sources
such as Hive tables, Structured Data files, external databases, or
existing RDDs.
• DataFrames support all common relational operators, including
projection (select), filter (where), join, and aggregations (groupBy).
• DataFrame is immutable, we can not change it.
• If we want to delete some rows, we can apply filter method.

8
SQLContext
• SQLContext is a class and is used for initializing the
functionalities of Spark SQL.

• Command to Initialize the SparkContext through spark-shell.

C:/>spark-shell

• Use the following command to create SQLContext.

scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)

9
Example
• Read an employee records stored in a JSON file
named employee.json.
• employee.json − Place this file in the directory where the
current scala> pointer is located.

10
File Reading and writing
• Read the JSON Document
scala> val dfs = sqlContext.read.json("employee.json")
• Show the Data
scala> dfs.show()

• We can see the employee data in a tabular format.

• scala> val df = spark.read.csv("pima_diabetes.csv")

• scala> df.select("*").show()
• It will display all rows and columns

11
Display Schema
• printSchema method
Display the Structure (Schema) of the DataFrame
scala> dfs.printSchema()

Output
root |-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)

12
Select and filter Method
• Use the following command to fetch name-column
among three columns from the DataFrame.
• scala> dfs.select("name“, “).show()

• scala> dfs.filter(dfs("age") > 23).show()

13
groupBy method
• Use following command for counting the number of
employees who are of the same age.
scala> dfs.groupBy("age").count().show()

Output − two employees are having age 23.

14
Reading Hive Tables
• employee.txt

• Start spark shell

#spark-shell
scala>
• Create SQLContext Object
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

15
SQL commands
• Create Table using HiveQL
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS employee(id
INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n’”)

• Load Data into Table using HiveQL

scala> sqlContext.sql("LOAD DATA LOCAL INPATH 'employee.txt' INTO
TABLE employee")

• Select Fields from the Table

scala> val result = sqlContext.sql("FROM employee SELECT id, name,
age")

scala> result.show() // display the result

16
SQL Code in Scala
• import org.apache.spark.sql.hive.HiveContext
• import org.apache.spark.sql.SQLContext
//sc is not required to import or create in scala
• val hiveCtx = new HiveContext(sc)
// Constructing a SQL context in Scala
• val input = hiveCtx.jsonFile(―iris.json‖)
• // Register the input schema RDD as temporary table iris
• input.registerTempTable("iris")
// Select records based on petalLength
• val topRows = hiveCtx.sql("select petalLength,petalWidth from iris order by
petalLength limit 5")
• topRows.show()

17
GROUP BY
• scala> val res = hiveCtx.sql("SELECT SUM(petalLength),
SUM(sepalLength), species FROM iris GROUP BY species")
• scala> res.show()

Inner Join
• scala> val i = spark.sql("select * from emp join dept on
(emp.id=dept.eid)")
• scala> val.show() //display the result of the join.

18
User-Defined Functions in Spark SQL

• User-defined functions, or UDFs, allow you to register custom

functions in Python, Java, and Scala to call within SQL.

• scala> hiveCtx.udf.register ("strLenScala", (_: String).length)

• scala> val speciesLength = hiveCtx.sql("SELECT

strLenScala(iris.species) FROM iris where species = 'setosa' limit
1")

Spark SQL
No ratings yet
Spark SQL
24 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL for Data Engineers
No ratings yet
Spark SQL for Data Engineers
25 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Spark SQL
No ratings yet
Spark SQL
18 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark SQL
No ratings yet
Spark SQL
10 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Bda U5
No ratings yet
Bda U5
42 pages
Spark SQL PPT 3.2.3 and 3.2.4
No ratings yet
Spark SQL PPT 3.2.3 and 3.2.4
17 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
MSBTE Questions of BDA
No ratings yet
MSBTE Questions of BDA
24 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
No ratings yet
Freedium - Cfd-I Spent 6 Hours Learning How Apache Spark Plans The Execution For Us
13 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Sparksql
No ratings yet
Sparksql
3 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Spark SQL Tutorial PDF
100% (1)
Spark SQL Tutorial PDF
35 pages
Spark SQL Tutorial
0% (1)
Spark SQL Tutorial
7 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Unit 5
100% (1)
Unit 5
109 pages
SQL and Nosql Programming With Spark
No ratings yet
SQL and Nosql Programming With Spark
63 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Spark SQL
100% (1)
Spark SQL
34 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Spark
No ratings yet
Spark
9 pages
Page 01
No ratings yet
Page 01
2 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
15 pages
Apache Spark RDD Overview
No ratings yet
Apache Spark RDD Overview
15 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Hadoop
No ratings yet
Hadoop
4 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
SparkSQL Extensions for Huawei
No ratings yet
SparkSQL Extensions for Huawei
39 pages
Big Data & Apache Spark Explained
No ratings yet
Big Data & Apache Spark Explained
31 pages
Apache Spark Overview & Features
No ratings yet
Apache Spark Overview & Features
65 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
SparkDataFrames 250719 202947
No ratings yet
SparkDataFrames 250719 202947
11 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
Learning Journal
No ratings yet
Learning Journal
4 pages
20 Jan Paper I EN
No ratings yet
20 Jan Paper I EN
55 pages
KMD Clustering for Bioinformatics
No ratings yet
KMD Clustering for Bioinformatics
12 pages
19 Jan Paper II Statistics HN 1
No ratings yet
19 Jan Paper II Statistics HN 1
34 pages
Module 5
No ratings yet
Module 5
27 pages
Mod 2
No ratings yet
Mod 2
43 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
32 pages
2021BCS0103
No ratings yet
2021BCS0103
15 pages
2021BCS0103
No ratings yet
2021BCS0103
7 pages
2021BCS0103 ICS322 Assignment2
No ratings yet
2021BCS0103 ICS322 Assignment2
10 pages
2021BCS0103 ML
No ratings yet
2021BCS0103 ML
1 page
2021BCS0103 Cse321 Lab
No ratings yet
2021BCS0103 Cse321 Lab
16 pages
2021BCS0103 Ics322
No ratings yet
2021BCS0103 Ics322
3 pages
2021BCS0103 Cse411 Lab-9
No ratings yet
2021BCS0103 Cse411 Lab-9
10 pages
2021BCS0103 CSE411 Lab8
No ratings yet
2021BCS0103 CSE411 Lab8
12 pages
2021BCS0103 Cse411 Lab6
No ratings yet
2021BCS0103 Cse411 Lab6
11 pages
2021BCS0103 MicroP Lab
No ratings yet
2021BCS0103 MicroP Lab
3 pages
2021BCS0103 Lab2 Microproc
No ratings yet
2021BCS0103 Lab2 Microproc
3 pages
2021BCS0103 CSE411 Lab5
No ratings yet
2021BCS0103 CSE411 Lab5
11 pages
OfferLetter PDF
No ratings yet
OfferLetter PDF
1 page
2021BCS0103 Cse321 Lab6
No ratings yet
2021BCS0103 Cse321 Lab6
12 pages
Spark GraphX for Data Scientists
No ratings yet
Spark GraphX for Data Scientists
43 pages
2021BCS0103 Cse321 Lab7
No ratings yet
2021BCS0103 Cse321 Lab7
3 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
9 - Pig Latin
No ratings yet
9 - Pig Latin
42 pages
PIG - Installation Step
No ratings yet
PIG - Installation Step
2 pages
Hadoop Setup Guide for Developers
No ratings yet
Hadoop Setup Guide for Developers
3 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
No ratings yet
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
5 pages
Struc Patterns
No ratings yet
Struc Patterns
86 pages
Beauty & Hair Services Menu
No ratings yet
Beauty & Hair Services Menu
4 pages
1cp1 01 Que 20211116
No ratings yet
1cp1 01 Que 20211116
16 pages
Phased PPAP Warrant
No ratings yet
Phased PPAP Warrant
2 pages
UNIT 5 Information Management
No ratings yet
UNIT 5 Information Management
55 pages
ADPTR MetricAB Manual
No ratings yet
ADPTR MetricAB Manual
49 pages
Supabase Dart Client - Introduction
No ratings yet
Supabase Dart Client - Introduction
107 pages
Advanced Steam Distillation: Vapodest 300
No ratings yet
Advanced Steam Distillation: Vapodest 300
2 pages
Sap MM Im Me2o SC Stock Monitoring For Vendor
No ratings yet
Sap MM Im Me2o SC Stock Monitoring For Vendor
7 pages
CAT 3516 B and 3516 B High Displacement Engines - Troubleshooting - CATERPILLAR
98% (42)
CAT 3516 B and 3516 B High Displacement Engines - Troubleshooting - CATERPILLAR
164 pages
Project Report On Advertising Agency
0% (1)
Project Report On Advertising Agency
56 pages
Organizing Computer Facility Centralized, Distributed, and Decentralized Computing Facilities
100% (3)
Organizing Computer Facility Centralized, Distributed, and Decentralized Computing Facilities
16 pages
PLT-06737 A.3 - HID FARGO DTC1500 Printer Firmware Release Notes
No ratings yet
PLT-06737 A.3 - HID FARGO DTC1500 Printer Firmware Release Notes
8 pages
Instructional Design Models
100% (7)
Instructional Design Models
30 pages
Bit Resume
No ratings yet
Bit Resume
2 pages
TrOCR: Transformer OCR Model
No ratings yet
TrOCR: Transformer OCR Model
10 pages
Presentation On Information System: Presented By: Amar Jeet
No ratings yet
Presentation On Information System: Presented By: Amar Jeet
19 pages
螢幕截圖 2024-11-13 下午4.39.38
No ratings yet
螢幕截圖 2024-11-13 下午4.39.38
47 pages
08 NetNumen U31 Alarm Management Operation - 50P
No ratings yet
08 NetNumen U31 Alarm Management Operation - 50P
50 pages
Evanstein MAT Thesis SCAMP
No ratings yet
Evanstein MAT Thesis SCAMP
58 pages
Python Practial File - 2023-24 Old
No ratings yet
Python Practial File - 2023-24 Old
35 pages
iOS Kernel Heap Armageddon
No ratings yet
iOS Kernel Heap Armageddon
95 pages
Activation Instructions
No ratings yet
Activation Instructions
2 pages
12.4.1.2 Lab - Isolate Compromised Host Using 5-Tuple
100% (2)
12.4.1.2 Lab - Isolate Compromised Host Using 5-Tuple
18 pages
Speech Synthesis Toward A Voice For All H. Timothy Bunnell
No ratings yet
Speech Synthesis Toward A Voice For All H. Timothy Bunnell
9 pages
FANUC System Variables Guide
No ratings yet
FANUC System Variables Guide
810 pages
Wolaita Sodo University School of Informatics: Department of Computer Science
No ratings yet
Wolaita Sodo University School of Informatics: Department of Computer Science
50 pages
Circuit
No ratings yet
Circuit
6 pages
Mc2W Mc2W Mc2W Mc2W: Integrated Wireless Control System
No ratings yet
Mc2W Mc2W Mc2W Mc2W: Integrated Wireless Control System
3 pages
Industrial Internet of Things: Applying Iot in The Industrial Context Professor Duncan Mcfarlane University of Cambridge
100% (1)
Industrial Internet of Things: Applying Iot in The Industrial Context Professor Duncan Mcfarlane University of Cambridge
16 pages
An Enhanced Passkey Entry Protocol For Secure Simple Pairing in Bluetooth
No ratings yet
An Enhanced Passkey Entry Protocol For Secure Simple Pairing in Bluetooth
13 pages
Metal Slug Anthology PSP Guide - 2007 - Ignition Entertainment
No ratings yet
Metal Slug Anthology PSP Guide - 2007 - Ignition Entertainment
9 pages

Spark SQL - Updated

Uploaded by

Spark SQL - Updated

Uploaded by

1

A DataFrame interface allows different DataSources to work on Spark SQL. It is a

• Command to Initialize the SparkContext through spark-shell.

• Use the following command to create SQLContext.

• We can see the employee data in a tabular format.

• scala> val df = spark.read.csv("pima_diabetes.csv")

• scala> dfs.filter(dfs("age") > 23).show()

Output − two employees are having age 23.

• Start spark shell

• Load Data into Table using HiveQL

• Select Fields from the Table

scala> result.show() // display the result

• User-defined functions, or UDFs, allow you to register custom

• scala> hiveCtx.udf.register ("strLenScala", (_: String).length)

• scala> val speciesLength = hiveCtx.sql("SELECT

You might also like