Anatomy of Data Source API : A deep dive into Spark Data source API

Anatomy of Data Source
API
A deep dive into the Spark Data source API
https://github.com/phatak-dev/anatomy_of_spark_datasource_api

● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Data Source API
● Schema discovery
● Build Scan
● Data type inference
● Save
● Column pruning
● Filter push

Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc

Building CSV data source
● Ability to load and save csv data
● Automatic schema discovery
● Support for user schema override
● Automatic data type inference
● Ability to columns pruning
● Filter push

Default Source
● Spark looks for a class named DefaultSource in given
package of Data source API
● Default Source should extend RelationProvider trait
● Relation Provider is responsible for taking user
parameters and turn them into a Base relation
● SchemaRelationProvider trait allows to specify user
defined schema
● Ex : DefaultSource.scala

Base Relation
● Represent collection of tuples with known schema
● Methods needed to be overridden
○ def sqlContext
Return sqlContext for building Data Frames
○ def schema:StructType
Returns the schema of the relation in terms of
StructType (analogous to hive serde)
● Ex : CsvRelation.scala

TableScan
● Table scan is a trait to be implemented for reading data
● It’s Base Relation that can produce all of it’s tuples as
an RDD of Row objects
● Methods to override
○ def buildScan(): RDD[Row]
● In csv example, we use sc.textFile to create RDD and
then Row.fromSeq to convert to an ROW
● CsvTableScanExample.scala

Inferring data types
● Treated every value as string as now
● Sample data and infer schema for each row
● Take inferred schema of first row
● Update table scan to cast it to right data type
● Ex: CsvSchemaDiscovery.scala
● Ex : SalesSumExample.scala

CreateTableRelationProvider
● DefaultSource should implement
CreateTableRelationProvider in order to support save
call
● Override createRelation method to implement save
mechanism
● Convert RDD[Row] to String and use saveAsTextFile to
save
● Ex : CsvSaveExample.scala

PrunedScan
● CsvRelation should implement PrunedScan trait to
optimize the columns access
● PrunedScan gives information to the data source which
columns it wants to access
● When we build RDD[Row] we only give columns need
● No performance benefit in Csv data, just for demo. But it
has great performance benefits in sources like jdbc
● Ex : SalesSumExample.scala

PrunedFilterScan
● CsvRelation should implement PrunedFilterScan trait
to optimize filtering
● PrunedFilterScan pushes filters to data source
● When we build RDD[Row] we only give rows which
satisfy the filter
● It’s an optimization. The filters will be evaluated again.
● Ex :CsvFilerExample.scala

Anatomy of Data Source API : A deep dive into Spark Data source API

More Related Content

What's hot

Similar to Anatomy of Data Source API : A deep dive into Spark Data source API

More from datamantra

Recently uploaded

Anatomy of Data Source API : A deep dive into Spark Data source API