KEMBAR78
Anatomy of Data Source API : A deep dive into Spark Data source API | PDF
Anatomy of Data Source
API
A deep dive into the Spark Data source API
https://github.com/phatak-dev/anatomy_of_spark_datasource_api
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Data Source API
● Schema discovery
● Build Scan
● Data type inference
● Save
● Column pruning
● Filter push
Data source API
● Universal API for loading/saving structured data
● Built In support for Hive, Avro, Json,JDBC, Parquet
● Third party integration through spark-packages
● Support for smart sources
● Third parties already supporting
○ Csv
○ MongoDB
○ Cassandra (in works)
etc
Data source API
Building CSV data source
● Ability to load and save csv data
● Automatic schema discovery
● Support for user schema override
● Automatic data type inference
● Ability to columns pruning
● Filter push
Schema discovery
Tag v0.1
CsvSchemaDiscovery Example
Default Source
● Spark looks for a class named DefaultSource in given
package of Data source API
● Default Source should extend RelationProvider trait
● Relation Provider is responsible for taking user
parameters and turn them into a Base relation
● SchemaRelationProvider trait allows to specify user
defined schema
● Ex : DefaultSource.scala
Base Relation
● Represent collection of tuples with known schema
● Methods needed to be overridden
○ def sqlContext
Return sqlContext for building Data Frames
○ def schema:StructType
Returns the schema of the relation in terms of
StructType (analogous to hive serde)
● Ex : CsvRelation.scala
Reading Data
Tag v0.2
TableScan
● Table scan is a trait to be implemented for reading data
● It’s Base Relation that can produce all of it’s tuples as
an RDD of Row objects
● Methods to override
○ def buildScan(): RDD[Row]
● In csv example, we use sc.textFile to create RDD and
then Row.fromSeq to convert to an ROW
● CsvTableScanExample.scala
Data Type inference
Tag v0.3
Inferring data types
● Treated every value as string as now
● Sample data and infer schema for each row
● Take inferred schema of first row
● Update table scan to cast it to right data type
● Ex: CsvSchemaDiscovery.scala
● Ex : SalesSumExample.scala
Save As Csv
Tag v0.4
CreateTableRelationProvider
● DefaultSource should implement
CreateTableRelationProvider in order to support save
call
● Override createRelation method to implement save
mechanism
● Convert RDD[Row] to String and use saveAsTextFile to
save
● Ex : CsvSaveExample.scala
Column Pruning
Tag v0.5
PrunedScan
● CsvRelation should implement PrunedScan trait to
optimize the columns access
● PrunedScan gives information to the data source which
columns it wants to access
● When we build RDD[Row] we only give columns need
● No performance benefit in Csv data, just for demo. But it
has great performance benefits in sources like jdbc
● Ex : SalesSumExample.scala
Filter push
Tag v0.6
PrunedFilterScan
● CsvRelation should implement PrunedFilterScan trait
to optimize filtering
● PrunedFilterScan pushes filters to data source
● When we build RDD[Row] we only give rows which
satisfy the filter
● It’s an optimization. The filters will be evaluated again.
● Ex :CsvFilerExample.scala

Anatomy of Data Source API : A deep dive into Spark Data source API

  • 1.
    Anatomy of DataSource API A deep dive into the Spark Data source API https://github.com/phatak-dev/anatomy_of_spark_datasource_api
  • 2.
    ● Madhukara Phatak ●Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3.
    Agenda ● Data SourceAPI ● Schema discovery ● Build Scan ● Data type inference ● Save ● Column pruning ● Filter push
  • 4.
    Data source API ●Universal API for loading/saving structured data ● Built In support for Hive, Avro, Json,JDBC, Parquet ● Third party integration through spark-packages ● Support for smart sources ● Third parties already supporting ○ Csv ○ MongoDB ○ Cassandra (in works) etc
  • 5.
  • 6.
    Building CSV datasource ● Ability to load and save csv data ● Automatic schema discovery ● Support for user schema override ● Automatic data type inference ● Ability to columns pruning ● Filter push
  • 7.
  • 8.
  • 9.
    Default Source ● Sparklooks for a class named DefaultSource in given package of Data source API ● Default Source should extend RelationProvider trait ● Relation Provider is responsible for taking user parameters and turn them into a Base relation ● SchemaRelationProvider trait allows to specify user defined schema ● Ex : DefaultSource.scala
  • 10.
    Base Relation ● Representcollection of tuples with known schema ● Methods needed to be overridden ○ def sqlContext Return sqlContext for building Data Frames ○ def schema:StructType Returns the schema of the relation in terms of StructType (analogous to hive serde) ● Ex : CsvRelation.scala
  • 11.
  • 12.
    TableScan ● Table scanis a trait to be implemented for reading data ● It’s Base Relation that can produce all of it’s tuples as an RDD of Row objects ● Methods to override ○ def buildScan(): RDD[Row] ● In csv example, we use sc.textFile to create RDD and then Row.fromSeq to convert to an ROW ● CsvTableScanExample.scala
  • 13.
  • 14.
    Inferring data types ●Treated every value as string as now ● Sample data and infer schema for each row ● Take inferred schema of first row ● Update table scan to cast it to right data type ● Ex: CsvSchemaDiscovery.scala ● Ex : SalesSumExample.scala
  • 15.
  • 16.
    CreateTableRelationProvider ● DefaultSource shouldimplement CreateTableRelationProvider in order to support save call ● Override createRelation method to implement save mechanism ● Convert RDD[Row] to String and use saveAsTextFile to save ● Ex : CsvSaveExample.scala
  • 17.
  • 18.
    PrunedScan ● CsvRelation shouldimplement PrunedScan trait to optimize the columns access ● PrunedScan gives information to the data source which columns it wants to access ● When we build RDD[Row] we only give columns need ● No performance benefit in Csv data, just for demo. But it has great performance benefits in sources like jdbc ● Ex : SalesSumExample.scala
  • 19.
  • 20.
    PrunedFilterScan ● CsvRelation shouldimplement PrunedFilterScan trait to optimize filtering ● PrunedFilterScan pushes filters to data source ● When we build RDD[Row] we only give rows which satisfy the filter ● It’s an optimization. The filters will be evaluated again. ● Ex :CsvFilerExample.scala