This document introduces Spark, a new cluster computing framework that supports applications with working sets of data that are reused across multiple parallel operations. Spark introduces Resilient Distributed Datasets (RDDs), which allow efficient sharing of data across jobs by caching datasets in memory across machines. RDDs provide fault tolerance through "lineage" which allows rebuilding of lost data partitions. Early results show Spark can outperform Hadoop by 10x for iterative machine learning jobs and enable interactive querying of large datasets with sub-second response times.