Introduction to Apache Spark

I'm currently part of a team at a customer site that is using Apache Spark to process time series data.

Apache Spark is an open source project that has received lots of attention in the last couple of years since it emerged from the Berkley Amplab.  Spark is now shepherded -- still as open source -- by a commercial entity called DataBricks, which was formed by the creator of Spark and several key personnel from the original team.  

Spark is a distributed processing system that runs on commodity servers, much like Hadoop.  However, where Hadoop’s processing model was strictly MapReduce, Spark offers a more generalized approach to distributed programming.  

Spark can tackle a wide variety of data processing problems -- more than Hadoop -- and these are not limited to analytics.

Here are several key reasons Spark appears to be the platform of choice for many projects that would have otherwise chosen Apache Hadoop:

  • Spark can do MapReduce but it doesn’t force the programmer into the Map-Reduce paradigm.  When implemented in Spark, Map-Reduce code is much more concise, achievable in a few lines of code, compared with dozens of lines in traditional Java on Hadoop.

  • Spark can be substantially faster than Hadoop for many types of jobs.  Spark makes good use of memory and gets closer to real-time performance whereas Map-Reduce on Hadoop is considered batch processing; even a short traditional Map-Reduce job typically takes more time than most users would be willing to wait in an interactive application.  If it runs out of RAM Spark will gracefully use disk, with a commensurate performance penalty.  

  • Spark has an SQL API.  This can be used to query very large HDFS data holdings using SQL.

  • Spark’s model was designed from the start to handle either batch or stream processing, with few modifications to the code required to handle one or the other.

  • Spark has APIs for Scala, Java and Python.  Spark is written in the Scala language and all new features appear in the Scala API first.  Scala is new to many teams but it offers a very concise and elegant functional programming approach that makes for tight and legible code.

  • Spark has a REPL which allows fast data exploration, experimentation and code development

  • Spark can read from HDFS, the file system, HBase and several other systems

  • Unlike some open source projects, Spark has a relatively large collection of code committers from a wide variety of academia, companies and other realms.  This usually indicates broad adoption and helps foster rapid development of new features and bug fixes.

  • The three leading commercial implementations of Hadoop (Cloudera, MapR and HortonWorks) have all bundled Spark into their products, and have (in at least one case I know of) added some Spark-specific tooling.

Things to keep in mind if you choose to evaluate Spark:

  • Cluster configuration and tuning are not trivial.  There are many parameters to such systems and knowing which “knobs to turn” to get something running optimally, or even properly, is not often clear.  Likewise, we have found no good guides that explain this art form clearly.

  • While Spark is very powerful, it’s not magic: you have to understand its RDD model very well or you’ll find your program won’t compile due to serialization or other problems.  

  • Your team should be composed of individuals with diverse experience in open source, learning new languages (if you choose Scala), and preferably experience with clusters.

Applications for Spark:

  • General purpose implementations of parallel-izable algorithms

  • Stream processing: Near real-time processing of data streams from social media, machine data & sensors, telecom, etc.

  • Machine learning:  Classification, recommendation, naive Bayes, etc.

  • Extract-Transform-Load pipelines

  • Large scale document indexing

Learning Spark:

Books:

Training, conferences and other resources in the DC/NY corridor: