Recent developments in Big Data analytics frameworks have been exciting, and Apache Spark™ has contributed to this excitement. But is the buzz Spark generates just white noise, or does it solve important business and technical problems? In short, does it deserve a place in your IT toolkit?
Yes. Part of the answer lies in why Spark was developed in the first place. Companies needed a single processing framework that could be used throughout what’s being called the data-driven enterprise. Until the release of Spark, nothing that fit these demanding requirements was available.
A Multi-Function Framework That’s a Toolkit in Disguise
Spark is an open source cluster computing framework designed for easy use, quick results and sophisticated analytics.
Apache Spark was built for easy access. Anyone with database knowledge and scripting skills in Python or Scala can use Spark. Spark is written in the Scala Programming Language and runs in the Java Virtual Machine (JVM) environment. And, Spark processing is based on its Resilient Distributed Dataset (RDD) application programming interface.
Versatility, power and speed. Spark’s value lies in its versatility and power. It enables rapid Big Data application development in the Scala, Java, Python, R and Clojure languages. And it supports quick analysis and model development with access to the largest data sets. Spark uses streaming to build real-time models by dividing data into more manageable pieces and processing it in parallel. The result: returns that are many times faster than any approach that requires disk access.
Not a one-note wonder. Spark supports more than just map and reduce functions. It’s designed to be an execution engine that works both in-memory and on-disk. It holds intermediate results in memory rather than writing them to disk. This is very useful especially when you work on the same dataset many times. When data does not fit in memory, operators perform external operations. Spark can be used to process datasets that are larger than a cluster’s total memory.
Built-In Spark Libraries Deliver Plenty of Processing Muscle
Other than the Spark core API, additional libraries are part of the Spark ecosystem. These include:
- Spark Streaming, which uses Spark Core fast scheduling to perform streaming analytics. It ingests and performs RDD transformations on mini-batches of data. This approach enables the piece of application code written for batch analytics to be used in streaming analytics on a single engine.
- Spark SQL, which supports structured and relational query processing. This library exposes Spark datasets over JDBC API and helps you run SQL-like queries on Spark data with BI and visualization tools. Spark SQL enables tech-savvy users to ETL their data from different formats, transform it, and expose it for ad-hoc queries.
- Spark MLlib is Spark’s scalable machine learning library. MLlib is designed to make practical machine learning scalable and easy. It uses common learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
- Spark GraphX is the new Spark API that creates graphs and performs graph-parallel computation. The library exposes fundamental operators and an optimized variant of the Pregel API. GraphX also includes a growing collection of graph algorithms and builders, which you can use to simplify graph analytics tasks.
But why not rely on Hadoop and Apache™ Storm? Don’t they collectively provide enough firepower to perform Big Data analysis tasks?
Apache Spark Versus Hadoop and Storm: Alternatives, Not Replacements
As a Big Data processing technology, Hadoop has been around for 10 years. It’s proven to be the solution of choice for high speed, high-volume data analysis. MapReduce is a great solution for one-pass computations. But it’s not very efficient for use cases that require multi-pass computations and algorithms.
- Each step in the data processing workflow has one map phase and one reduce phase. To use this solution, you must convert any use case into a MapReduce pattern. The job output data between each step must be stored in the distributed file system before the next step can begin. So, this approach tends to be slow due to limited replication and disk storage speed.
- Typically, Hadoop solutions include clusters that are hard to set up and manage. These steps require the integration of several tools for different Big Data use cases (like Mahout for machine learning and Storm for streaming data processing).
- Spark enables programmers to develop complex, multi-step data pipelines that use directed acyclic graph (DAG) patterns. It also supports in-memory data sharing across DAGs, so you can engage in different jobs with the same data.
Nevertheless, it’s more helpful to look at Spark as an alternative rather than a replacement to Hadoop. That’s because Spark was designed to provide a comprehensive and unified solution to manage different Big Data use cases and requirements.
About Vishwas: Vishwas provides solutions to big data problems like real-time streaming data, Traditional SQL vs NoSQL, Hadoop or Spark, Amazon Cloud Services (AWS) vs Personal Cluster. He is focused on analysing and providing the optimum solution for a particular use case. Prior to Syntelli, Vishwas was a Research Assistant at University of North Carolina at Charlotte. While at UNC Charlotte, Vishwas worked on integrating Big Data with Mobile Devices (IoT) and Deploying Language Classifier on a Pseudo Distributed Spark Cluster. Vishwas received a M.S in Electrical Engineering from the University of North Carolina at Charlotte. His research interests are Spark development, Visual Analytics, Android Devices.
 Excerpts taken from http://insidebigdata.com/2015/11/09/an-insiders-guide-to-apache-spark/.