KEMBAR78
Unit testing of spark applications | PDF
Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
AgendaAgenda
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
What is Spark ?What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
src: http://spark.apache.org/src: http://spark.apache.org/
What is Unit Testing ?What is Unit Testing ?
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
● Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
Why we need Unit Testing ?Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
Unit Testing of Spark ApplicationsUnit Testing of Spark Applications
Unit to TestUnit to Test
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)
lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
}
}
Method 1Method 1
import org.scalatest.{ BeforeAndAfterAll, FunSuite }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class WordCountTest extends FunSuite with BeforeAndAfterAll {
private var sparkConf: SparkConf = _
private var sc: SparkContext = _
override def beforeAll() {
sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local")
sc = new SparkContext(sparkConf)
}
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
override def afterAll() {
sc.stop()
}
}
Cons of Method 1Cons of Method 1
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
● Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
Method 2 (Better Way)Method 2 (Better Way)
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
How ?How ?
Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
}
Example 1Example 1
Method 2 (Better Way) contd...Method 2 (Better Way) contd...
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
import com.holdenkarau.spark.testing.RDDComparisons
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd with comparison") {
val expected =
sc.textFile("file.txt")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
val result = wordCount.get("file.txt", sc)
assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Example 2Example 2
Pros of Method 2Pros of Method 2
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
When to use What ?When to use What ?
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 1
● For Small Scale Spark
applications.
● No requirement of extended
capabilities of spark-testing-base.
● For Sample applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.
Method 2
● For Large Scale Spark
applications.
● Requirement of Cluster mode or
Performance testing.
● For Production applications.
DemoDemo
Questions & Option[A]Questions & Option[A]
ReferencesReferences
● https://github.com/holdenk/spark-testing-base
● Effective testing for spark programs Strata NY 2015
● Testing Spark: Best Practices
● https://github.com/holdenk/spark-testing-base
● Effective testing for spark programs Strata NY 2015
● Testing Spark: Best Practices
Thank youThank you

Unit testing of spark applications

  • 1.
    Unit Testing ofSpark ApplicationsUnit Testing of Spark Applications Himanshu Gupta Sr. Software Consultant Knoldus Software LLP Himanshu Gupta Sr. Software Consultant Knoldus Software LLP
  • 2.
    AgendaAgenda ● What isSpark ? ● What is Unit Testing ? ● Why we need Unit Testing ? ● Unit Testing of Spark Applications ● Demo ● What is Spark ? ● What is Unit Testing ? ● Why we need Unit Testing ? ● Unit Testing of Spark Applications ● Demo
  • 3.
    What is Spark?What is Spark ? ● Distributed compute engine for large-scale data processing. ● 100x faster than Hadoop MapReduce. ● Provides APIs in Python, Scala, Java and R (Spark 1.4) ● Combines SQL, streaming and complex analytics. ● Runs on Hadoop, Mesos, or in the cloud. ● Distributed compute engine for large-scale data processing. ● 100x faster than Hadoop MapReduce. ● Provides APIs in Python, Scala, Java and R (Spark 1.4) ● Combines SQL, streaming and complex analytics. ● Runs on Hadoop, Mesos, or in the cloud. src: http://spark.apache.org/src: http://spark.apache.org/
  • 4.
    What is UnitTesting ?What is Unit Testing ? ● Unit Testing is a Software Testing method by which individual units of source code are tested to determine whether they are fit for use or not. ● They ensure that code meets its design specifications and behaves as intended. ● Its goal is to isolate each part of the program and show that the individual parts are correct. ● Unit Testing is a Software Testing method by which individual units of source code are tested to determine whether they are fit for use or not. ● They ensure that code meets its design specifications and behaves as intended. ● Its goal is to isolate each part of the program and show that the individual parts are correct. src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
  • 5.
    Why we needUnit Testing ?Why we need Unit Testing ? ● Find problems early - Finds bugs or missing parts of the specification early in the development cycle. ● Facilitates change - Helps in refactoring and upgradation without worrying about breaking functionality. ● Simplifies integration - Makes Integration Tests easier to write. ● Documentation - Provides a living documentation of the system. ● Design - Can act as formal design of project. ● Find problems early - Finds bugs or missing parts of the specification early in the development cycle. ● Facilitates change - Helps in refactoring and upgradation without worrying about breaking functionality. ● Simplifies integration - Makes Integration Tests easier to write. ● Documentation - Provides a living documentation of the system. ● Design - Can act as formal design of project. src: https://en.wikipedia.org/wiki/Unit_testingsrc: https://en.wikipedia.org/wiki/Unit_testing
  • 6.
    Unit Testing ofSpark ApplicationsUnit Testing of Spark Applications
  • 7.
    Unit to TestUnitto Test import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD class WordCount { def get(url: String, sc: SparkContext): RDD[(String, Int)] = { val lines = sc.textFile(url) lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _) } }
  • 8.
    Method 1Method 1 importorg.scalatest.{ BeforeAndAfterAll, FunSuite } import org.apache.spark.SparkContext import org.apache.spark.SparkConf class WordCountTest extends FunSuite with BeforeAndAfterAll { private var sparkConf: SparkConf = _ private var sc: SparkContext = _ override def beforeAll() { sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local") sc = new SparkContext(sparkConf) } private val wordCount = new WordCount test("get word count rdd") { val result = wordCount.get("file.txt", sc) assert(result.take(10).length === 10) } override def afterAll() { sc.stop() } }
  • 9.
    Cons of Method1Cons of Method 1 ● Explicit management of SparkContext creation and destruction. ● Developer has to write more lines of code for testing. ● Code duplication as Before and After step has to be repeated in all Test Suites. ● Explicit management of SparkContext creation and destruction. ● Developer has to write more lines of code for testing. ● Code duplication as Before and After step has to be repeated in all Test Suites.
  • 10.
    Method 2 (BetterWay)Method 2 (Better Way) "com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2" Spark Testing Base A spark package containing base classes to use when writing tests with Spark. Spark Testing Base A spark package containing base classes to use when writing tests with Spark. How ?How ?
  • 11.
    Method 2 (BetterWay) contd...Method 2 (Better Way) contd... import org.scalatest.FunSuite import com.holdenkarau.spark.testing.SharedSparkContext class WordCountTest extends FunSuite with SharedSparkContext { private val wordCount = new WordCount test("get word count rdd") { val result = wordCount.get("file.txt", sc) assert(result.take(10).length === 10) } } Example 1Example 1
  • 12.
    Method 2 (BetterWay) contd...Method 2 (Better Way) contd... import org.scalatest.FunSuite import com.holdenkarau.spark.testing.SharedSparkContext import com.holdenkarau.spark.testing.RDDComparisons class WordCountTest extends FunSuite with SharedSparkContext { private val wordCount = new WordCount test("get word count rdd with comparison") { val expected = sc.textFile("file.txt") .flatMap(_.split(" ")) .map((_, 1)) .reduceByKey(_ + _) val result = wordCount.get("file.txt", sc) assert(RDDComparisons.compare(expected, result).isEmpty) } } Example 2Example 2
  • 13.
    Pros of Method2Pros of Method 2 ● Succinct code. ● Rich Test API. ● Supports Scala, Java and Python. ● Provides API for testing Streaming applications too. ● Has in-built RDD comparators. ● Supports both Local & Cluster mode testing. ● Succinct code. ● Rich Test API. ● Supports Scala, Java and Python. ● Provides API for testing Streaming applications too. ● Has in-built RDD comparators. ● Supports both Local & Cluster mode testing.
  • 14.
    When to useWhat ?When to use What ? Method 1 ● For Small Scale Spark applications. ● No requirement of extended capabilities of spark-testing-base. ● For Sample applications. Method 1 ● For Small Scale Spark applications. ● No requirement of extended capabilities of spark-testing-base. ● For Sample applications. Method 2 ● For Large Scale Spark applications. ● Requirement of Cluster mode or Performance testing. ● For Production applications. Method 2 ● For Large Scale Spark applications. ● Requirement of Cluster mode or Performance testing. ● For Production applications.
  • 15.
  • 16.
  • 17.
    ReferencesReferences ● https://github.com/holdenk/spark-testing-base ● Effectivetesting for spark programs Strata NY 2015 ● Testing Spark: Best Practices ● https://github.com/holdenk/spark-testing-base ● Effective testing for spark programs Strata NY 2015 ● Testing Spark: Best Practices
  • 18.