Introduction to Apache Spark

Updated: Aug 24, 2019

What is Apache Spark?

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is used for iterative computing, interactive SQL and high speed stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

Features of Apache Spark

Speed − Spark helps to run an application in the Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing the number of read/write operations to disk. It stores the intermediate processing data in memory.

Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.

Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Runs Everywhere: Integrates well with the Hadoop ecosystem and other data sources

(HDFS, Amazon S3, Hive, HBase, Cassandra, etc.). It Can run clusters managed by

Hadoop YARN or Apache Mesos, and can also run standalone.

Apache Spark follows a master/slave architecture with two daemons and a cluster manager :

Master Daemon – (Master/Driver Process)

Worker Daemon –(Slave Process)

A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes and users can run them on the same horizontal spark cluster or on separate machines i.e. in a vertical spark cluster or in mixed machine configuration.

Role of Driver in Spark Architecture:

Spark Driver – Master Node of a Spark Application

It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where the Spark Context is created. Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster. The driver program that runs on the master node of the Spark cluster schedules the job execution and negotiates with the cluster manager. It translates the RDD’s into the execution graph and splits the graph into multiple stages. Driver stores the metadata about all the Resilient Distributed Databases and their partitions. Cockpits of Jobs and Tasks Execution -Driver program converts a user application into smaller execution units known as tasks. Tasks are then executed by the executors i.e. the worker processes which run individual tasks. Driver exposes the information about the running spark application through a Web UI at port 4040.

Role of Executor in Spark Architecture

Executor is a distributed agent responsible for the execution of tasks. Every spark applications has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”. However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload. Executor performs all the data processing. Reads from and writes data to external sources. Executor stores the computation results in data in-memory, cache or on hard disk drives.

Interacts with the storage systems.

Role of Cluster Manager in Spark Architecture

An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job. There are 3 different types of cluster managers a Spark application can leverage for the allocation and deallocation of various physical resources such as memory for client spark jobs, CPU memory, etc. Hadoop YARN, Apache Mesos or the simple standalone spark cluster manager either of them can be launched on-premise or in the cloud for a spark application to run.

Choosing a cluster manager for any spark application depends on the goals of the application because all cluster managers provide different set of scheduling capabilities. To get started with apache spark, the standalone cluster manager is the easiest one to use when developing a new spark application.

How to Run Apache Spark Application on a cluster

  • Using spark-submit, the user submits an application.

  • it invokes the main() method that the user specifies in spark-submit. It also launches the driver program.

  • The driver program asks for the resources to the cluster manager that we need to launch executors.

  • The cluster manager launches executors on behalf of the driver program.

  • The driver process runs with the help of the user application. Based on the actions and transformation on RDDs, the driver sends work to executors in the form of tasks.

  • The executors process the task and the result sends back to the driver through the cluster manager.

Spark Data Structures - RDD

Resilient Distributed Dataset:

RDDs are Immutable and partitioned collection of records, which can only be created by coarse-grained operations such as map, filter, group by, etc. By coarse-grained operations, it means that the operations are applied to all elements in datasets. RDDs can only be created by reading data from stable storage such as HDFS or by transformations on existing RDDs.

Now, How Is That Helping for Fault Tolerance?

Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data.Graph of transformations to produce one RDD is called as Lineage Graph.

Spark RDD Lineage Graph

In case of we lose some partition of RDD , we can replay the transformation on that partition in lineage to achieve the same computation, rather than doing data replication across multiple nodes.This characteristic is biggest benefit of RDD , because it saves a lot of efforts in data management and replication and thus achieves faster computations

Apache Spark RDD supports two types of Operations-




Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable). Operations like map, filter, flatMap are transformations.

There are two types of transformations:

Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey and reducebyKey.


Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.

An action is one of the ways of sending data from Executer to the driver.Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.

Spark Dataframe

In Spark, Dataframes are distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframes are similar to traditional database tables, which are structured and concise. We can say that, Dataframes are relational databases with better optimization techniques.

Spark Dataframes can be created from various sources, such as hive tables, log tables, external databases, or existing RDDs. They allow processing of huge amounts of data.

Features of Dataframes

The main reason why Dataframes were created is to overcome the difficulties faced while using RDDs. Some of the features of Dataframes are:

Use of Input Optimization Engine: Dataframes make use of input optimization engines like Catalyst Optimizer to process the data efficiently. We can use the same engine for all Python, Java, Scala, and R Dataframe APIs.

Handling Structured Data: Dataframes provide a schematic view of data. The data has some meaning to it when it is being stored.

Custom Memory Management: In RDDs, the data is stored in memory, whereas Dataframes store data off-heap (outside the main Java Heap space, but still inside RAM), which in turn reduces the garbage collection overload.

Flexibility: Dataframes, like RDDs, can support various formats of data, such as csv, Cassandra, etc.

Scalability: Dataframes can be integrated with various other Big Data tools, and they allow processing megabytes to petabytes of data at once.

Creating Dataframes

There are many ways to create Dataframes. Here are three of the most commonly used methods used to create Dataframes:

1).Creating Dataframes from JSON Files

JSON, or JavaScript Object Notation, is a type of file that stores simple data structure objects in the .JSON format. It is mainly used to transmit data between web servers. This is how a simple .JSON file looks like:

The above JSON is a simple employee database file which contains two records/rows.

When it comes to Spark, .JSON files which are being loaded are not the typical JSON file. You cannot load a normal JSON file into a Dataframe. The JSON file which you want to load should be of the format given below:

JSON files can be loaded onto the Dataframes using the read.JSON function, with the file name you want to upload it

Example: Here, we are loading an Olympic medal count sheet onto a Dataframe. There are 10 fields in total. The function printSchema() prints the schema of the Dataframe.

Creating Dataframes from Existing RDDs

Dataframes can also be created from the existing RDDs. First, you create an RDD and then load that RDD on to a Dataframe using the createDataframe(Name_of_the_rdd_file) function. In the below figure, we are creating an RDD first, which contains numbers from 1 to 10 and their cubes and then loading that RDD onto a Dataframe.

3).Creating Dataframes from .csv Files

You can also create Dataframes by loading .csv files. Here is an example for loading a csv file onto a Dataframe.

Spark Datasets

Datasets are an extension of the Dataframe APIs in Spark. In addition to the features of Dataframes and RDDs, datasets provide various other functionalities. Datasets provide an object-oriented programming interface, which includes the concepts of classes and objects.

Datasets were introduced when Spark 1.6 was released. Datasets provide the convenience of RDDs, static typing of Scala, and the optimization features of Dataframes.

Datasets are a collection of Java Virtual Machine (JVM) objects, which uses Spark’ Catalyst Optimizer to provide efficient processing.

Components of spark :

Apache Spark Core

Spark Core consists of a general execution engine for spark platform that all required by other functionality which is built upon as per the requirement approach. It provides in-built memory computing and referencing data sets stored in external storage systems.

Spark allows the developers to write code quickly with the help of a rich set of operators. While it takes a lot of lines of code, it takes fewer lines to write the same code in Spark Scala.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new set of data abstraction called Schema RDD, which provides support for both the structured and semi-structured data.

Spark Streaming

This spark component allows Spark to process real-time streaming data. It provides an API to manipulate data streams that match with RDD API. It allows the programmers to understand the project and switch through the applications that manipulate the data and giving outcomes in real-time. Similar to Spark Core, Spark Streaming strives to make the system fault-tolerant and scalable.

MLlib (Machine Learning Library)

Apache Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms, classification, clustering and collaboration filters, etc. It also includes a few lower-level primitives. All these functionalities help Spark scale out across a cluster.


Spark also comes with a library to manipulate the graphs and performing computations, called GraphX. Just like Spark Streaming and Spark SQL, GraphX also extends Spark RDD API which creates a directed graph. It also contains numerous operators in order to manipulate the graphs along with graph algorithms.

In total, we’ve found over 3,000 companies using Apache Spark, including top players like Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks and Amazon. Spark made waves in the past year as the Big Data product with the shortest learning curve, popular with SMBs and Enterprise teams alike. Another reason for its prominence could be Spark’s embrace of several different programming languages, making production and scale much easier for diverse dev teams.

About Data Science Authority

Data Science Authority is a company engaged in Training, Product Development and Consulting in the field of Data science and Artificial Intelligence. It is built and run by highly qualified professionals with more than 10 years of working experience in Data Science. DSA’s vision is to inculcate data thinking in to individuals irrespective of domain, sector or profession and drive innovation using Artificial Intelligence.

Data Science Authority | Data Science Training in Hyderabad



  • Facebook Social Icon
  • Twitter Social Icon
  • LinkedIn Social Icon


Gachibowli, Hyderabad, Telangana, India

©2020  Data Science Authority