Spark
Apache Spark is an open-source computing framework. It was originally developed at the University of California, Berkeley's AMPLab in 2009 and donated to the Apache Software Foundation. It's part of a greater set of tools, along with Apache Hadoop and other open-source resources which are used in today’s analytics community.
Advantages
- Lighting fast processing – Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk
- Support for sophisticated analytics – Spark supports SQL queries, streaming data, complex analytics such as graph algorithms, and machine learning. Users can combine all these capabilities in a single workflow
- Real-time Stream Processing
- Ability to integrate with Hadoop and existing Hadoop Data
- Active and expanding community
Disadvantages
- Data arriving out of time order is a problem for batch-based processing
- Batch length restricts Window-based analytics – data is often of poor quality, some records might be missing and streams can arrive with data out of time order
- It offers limited performance per server according to stream processing standards these days. It scales out large numbers of servers to gain overall system performance
- Writing stream processing operations from scratch is not easy – Spark streaming offers limited binaries of stream functions
Components
- Types of cluster managers:
- Standalone: a simple cluster manager that makes it easy to set up a cluster
- Apache Mesos: a general cluster manager that can run service applications
- Hadoop YARN: the resource manager in Hadoop 2.0
- Shipping code to the cluster – dynamically adding new files to be sent to executors
- Monitoring – offers information about running executors and tasks
- Job scheduling – control over resource allocation both on across and within applications is permitted
Development tools
An example of a project developed with this technology
Have a project in mind?
Get in touch with us for your software development needs!
CONTACT US