Understanding Apache Spark in three quick questions.
1) What is Apache Spark?
According to Apache, it is an open source cluster-computing framework for fast and flexible large-scale data analysis. To put it in a simpler way, it’s a scalable data process engine, in words of Loraine Lawson from the IT Bussines Edge. It works with the filesystem to distribute your data across the cluster, and process it in parallel while taking a set of instructions from an application written by a developer.
2) What’s the use of Apache Spark?
Databricks, the company founded by the creators of Apache Spark, gives the following applications: 1. Data integration and ETL, 2. Interactive analytics or business intelligence, 3. High performance batch computation, 4. Machine learning and advanced analytics, and 5. Real-time stream processing.
Most users are doing data integration and ETL on MapReduce, as well as batch computation, machine learning and batch analytics. With Spark, these things are going to be much faster as interactive analytics and BI are possible on Spark, the same goes for real-time stream processing. Altough some things couldn’t be done at a good rate in MapReduce, with Spark this is optimized: some old things can be done in faster ways but there’s space for innovation too.
3) Will Apache Spark substitute Hadoop?
It’s not the most probable scenario: Hadoop is storage-centered, and Spark is about processing data. What it most likely will replace is MapReduce, which is the processing model that shipped with Hadoop. Apache reports that Spark runs programs up to 100 times faster than MapReduce in memory, or 10 times faster on disk. DataBricks used Spark to sort 100 terabytes of records within 23 minutes in 2013. That’s a significative difference in a world where speed is priority.