Spark Framework for Big Data Analysis
SPARK FRAMEWORK FOR BIG DATA ANALYSIS ON PSEUDO
DISTRIBUTED CLUSTERS WHITE PAPER
Many ”big data” applications have been designed which track statistics about page views in real time, train a machine learning model and automatically detect anomalies. But these applications often require different set of tools like Map-Reduce on Hadoop (MR), Hive, Hadoop Streaming, Weka and Mahout to create models and classifiers.
This white paper talks about streaming data operated on various layers of the Spark stack, such as Spark Streaming, Spark SQL, Spark Machine Learning libraries (MLlib).
In this paper, our data scientist Vishwas Subramanian discusses:
- transforming a stream of live Twitter data into datasets,
- carrying out feature extraction,
- constructing a model and analyzing the data,
- improving the language classification
- applying the model back in real time on a Pseudo Distributed System (LXCs)