hortonworks architectureBig data has been around for several years now. If you want to escape the big data drumbeat, we understand. But Hortonworks hosted a series of 18 engaging and informative open house events designed to spice up the data message.

On February 4th the final event landed in Charlotte, North Carolina. Customer evidence was the heart of the proceedings. Hortonworks customers told how Open Enterprise Hadoop transformed their businesses. And, OEH partners explained how their Hortonworks technologies enrich the value of their data operations.

Changing architecture of enterprise big data analytics

Big data architecture is evolving rapidly with many new components in the market place, each catering to a specific data need. For handling real-time data feeds, we have Kafka, and for batch & distributed processing of streaming data we have Storm.  Spark is increasingly being used to receive the high velocity data and process it.  In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications.  HDFS provides the platform for distributed computation. The processed data is then stored in Teradata servers, where it is then pulled out for reports or visualizations with Tableau.   Zeppelin, (which is currently still incubating) a web-based interactive data analytics and visualization notebook, is used to perform analytics.

Insurance and transportation use cases

The Hadoop system is enabling new applications every minute, which were virtually impossible to envision without the distributed processing platform it offers.  For example, pay as you drive, pay how you drive and mile-based auto insurance are different names for an old product—car insurance—that’s measured in a new way. In this new approach, costs depend on the type of vehicle you drive, the time you’re behind the wheel, how far you go, and how you behave behind the wheel.

Driver data is stored and analyzed to determine the cost of the insurance. As you might expect, each driver generates a lot of data.  Progressive Auto insurance uses Hortonworks as a platform to manage the massive data sets.

And let’s don’t forget the truckers. During the technology demonstration with Hortonworks, there was a presentation on using real-time analytics to monitor driver behavior behind the wheel. The program uses driver’s driving history, weather and current driving conditions data to predict which drivers are more likely to violate the law. The predictive analytics flags drivers and alerts them before they become violators.

What’s new in the Apache Hadoop ecosystem

Here are a few new comers to Apache Hadoop users.

Data security with an emphasis on governance.  Apache Knox, Ranger make up the stack of this system used for authentication and data access security.

  • The Ranger framework enables users to monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Knox entends Hadoop’s REST/HTTP services by encapsulating Kerberos within the cluster, providing perimeter security.

Users can get answers to these questions:

  • Who accessed the data?
  • Which data was accessed?
  • Where, when and how was the data accessed?

Data Governance and Metadata framework:
Apache Atlas is the program of choice. Apache Atlas governs components that are part of the native Hadoop stack.

  • Works as a good citizen and should snap into the existing framework.
  • Has a strong rest-based API that provides flexible access to data.
  • Processes metadata from components within or outside the Hadoop system.
  • Gets end-to-end lineage from the source to the system, with graph view.
  • Can define any taxonomy model.

Data lifecycle management.  Apache Falcon is the framework of this platform.

  • Lifecycle steps. The lifecycle uses standard steps: ingest, cleanse, transform, mirror, archive and remove.
  • Data replication. Replication across on-premise and cloud-based storages targets.
  • System support. The system is supported by Amazon S3 and Microsoft Azure.
  • Data retention: Users can retain data up to 5 years.

Interactive Analytics: Apache Zeppelin, a web-based notebook that enables interactive data analytics.
You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala, Python and more.


Interested to learn more about Hadoop and how it can be applied to grow your business?

Let’s chat!

Contact us today!