This Apache Spark tutorial explains what is Apache Spark, including the installation process, writing Spark application with examples:
We believe that learning the basics and core concepts correctly is the basis for gaining a good understanding of something. Especially if you are new to the subject.
Here, we will give you the idea and the core concept of Apache Spark in simple terms, which will help you understand the complex operations and computations in Spark easily.
After going through this tutorial, you will know what Spark is, why we need Spark, Spark architecture, how to execute a Spark application, and the essentials to write a Spark application.
So we invite all of you to read the tutorial till the end. Learn the story behind Apache Spark and get to know about it.
What is Apache Spark
Apache Spark is a data processing engine for distributed environments. Assume you have a large amount of data to process. By writing an application using Apache Spark, you can complete that task quickly.
Nowadays, a large amount of data or big data is stored in clusters of computers. If you want to process data in such an environment, you can get the help of Apache Spark. Spark can work on a cluster.
Spark is easy to use, you need not worry about the cluster of computers you are working on. You can simply work on it as you are using a single machine.
Spark has APIs to both Scala and Python. You can choose the language according to your preference. It also provides packages and libraries to enhance its functionality. You can use SQL, machine learning, R, graph computations in the Spark environment using these packages and libraries.
Apache Spark is a better alternative for Hadoop’s MapReduce, which is also a framework for processing large amounts of data. Apache Spark is ten to a hundred times faster than MapReduce. Unlike MapReduce, Spark can process data in real-time and in batches as well.
=> Visit Official Spark Website
History of Big Data
Big data
Nowadays, big data is a very popular term. Everyone is searching for the term big data and its related topics. Everyone is more alert about the term than ever before.
So what is big data? As its name suggests, the term “big data” refers to huge amounts of data.
Here, we are talking about data files with millions of records in them and hundreds and thousands of such data files.
The development of technology, especially the internet, causes data to grow exponentially. With the growth of data, storing them also became a problem. This much data cannot be stored in a normal computer, because it doesn’t have enough resources to do so.
Therefore, the world needed a better solution.
If you are running out of storage capacity on your desktop or laptop, what would you do? Simply, you attach external storage to your machine. Isn’t it? This was the approach that was taken to solve this storage problem. The storage capacity of the machines was scaled up. This approach is also known as vertical scaling, scaling up the capacity of a single system.
But when data is growing day by day in a rapid phase, it is hard to scale up vertically. They had to scale up the storage capacity again and again in a short space of time. Data processing in such machines was time-consuming as well.
Then came the horizontal scaling approach, which means they use a network of single machines, instead of scaling up a single large machine. It was easy to implement, get a few small machines and connect them as a network to use the storage of them. If you run out of storage, get another machine, and connect it to the network. It was as simple as that.
Handling a network of machines is difficult. So these horizontally scaled systems needed a mechanism to handle the storage as a single unit and a mechanism to process the data in such an environment. Hence Hadoop arrives.
Hadoop
Hadoop is a framework to develop data processing applications in a distributed environment.
In the beginning, it comprised 2 main components:
Later, the YARN resource manager was added to Hadoop as the resource management layer.
These three main components help Hadoop to address the filesystem and processing issue in distributed environments.
HDFS
HDFS stands for Hadoop Distributed File System. It is a network-based file system.
With HDFS, we don’t need to worry about every single computer in the cluster. HDFS helps us to see the distributed system as a single file system. We can create or store big data files in the system without worrying about the cluster of machines.
MapReduce
MapReduce is a programming model which is used to perform parallel processing. It helps us to process data, which are distributed in a cluster, with ease. It allows us to process data in batches.
Arrival of Spark
Even though MapReduce is a good programming model for processing data in a cluster, it has a few drawbacks.
These are enlisted below:
- First, programmers found it hard to use. Unlike simple batch processing tasks, with the complexity of tasks increasing, the complexity to align with MapReduce to solve them increases too.
- Also, as the need for real-time processing arises, a method to cope with that was in need. Since MapReduce can only do batch processing, a different solution was needed.
Then Apache Spark was introduced, a framework to do both batch processing and real-time processing much faster than MapReduce.
Now we know Spark is a data processing engine, and we needed Spark to overcome the drawbacks of MapReduce. Let’s learn how Spark works.
Spark Architecture
Apache Spark uses master-slave architecture.
Just like in the real world, the master will get the job done by using his slaves. It means that you will have a master process and multiple slave processes which are controlled by that dedicated master process.
Master manages, maintains, and monitors the slaves while slaves are the actual workers who perform the processing tasks. You tell the master what wants to be done and the master will take care of the rest. It will complete the task, using its slaves.
In the Spark environment, master nodes are referred to as drivers, and slaves are referred to as executors.
Drivers
Drivers are the master process in the Spark environment. It contains all the metadata about slaves or, in Spark terms, executors.
Drivers are responsible for: Analyzing, Distributing, Scheduling, and Monitoring the executors.
Executors
Executors are the slave processes in the Spark environment. They perform the data processing which is assigned to them by their master.
Executing a Spark program
To execute a Spark application, first, you need to install Spark on your machine or in your cluster.
According to the Spark documentation, the only thing you need as a prerequisite to installing Spark is Java. Install Java on your computer and you are ready to install Spark on your computer.
However, when you are developing real-world Spark applications with more complex computations, you might need to install Scala or Python on your computer, compatible with the Spark version you are going to use.
Spark is only compatible with Scala and Python, and you have the luxury to choose a language you are comfortable with from them.
How to Install Spark
Installing Apache Spark is very easy. Spark supports both Windows and UNIX-like systems, such as Linux, Mac OS, etc. You just have to download Apache Spark and extract the files to your local machine.
You can download the latest Spark version, from the Spark official website, with Hadoop pre-built on it. If you already have a Hadoop cluster, there are Hadoop-free binaries as well.
There are two main ways to execute Spark applications.
- Using Spark shell
- Using the Spark submit method
#1) Spark shell
Spark shell is an interactive way to execute Spark applications. Just like in the Scala shell or Python shell, you can interactively execute your Spark code on the terminal.
It is a better way to learn Spark as a beginner. Spark-shell prints information about your last action as you enter it into the shell. This makes programmers’ lives easy. They use this to test their lines of code before adding them to the Spark application.
For example, if you are not sure about the datatype going to return as you execute a function or method on a variable, you can just try it on Spark shell and it will give you information about the return datatype.
#2) Spark-submit
In a real-world scenario, you cannot execute programs interactively. More often than not, you have to process data periodically or in real-time using jar files, which contain the code to process data.
Spark submit is the method used in Spark to run application jars in a Spark environment. Just like running a Java jar file on a terminal, you can run Spark applications using spark-submit.
Writing a Spark application
RDDs and Dataframes
There are 3 main data structures in Spark:
- RDDs
- Dataframes
- Data sets
These data structures in Spark help to store the extracted data until it is loaded into a data source after processing.
Everything in Spark is an RDD. Dataframes and data sets are also abstractions of RDD.
RDD stands for Resilient Distributed Dataset. When you load the data into a Spark application, it creates an RDD that stores the loaded data.
Spark gets the data file, splits it into small chunks of data, and distributes those chunks of data across the cluster. This collection of data is referred to as RDDs.
They are immutable, meaning that you can only create an RDD by loading data into a Spark application or transforming an RDD into another one. You can’t change or update an existing RDD.
As mentioned earlier, data frames are an abstraction of RDD. It handles structured data. When you load data into a data frame, it will create a table structure, such as a database table, so you can easily access the structured data.
So when you load data into a Spark application, you have a choice. You can choose the data type you want to store your data depending on your requirement.
To process the extracted data, you need to execute functions on them.
Transformation and Action functions
There are 2 types of functions in Spark.
- Some are used to transform data into another
- Some are used to collect transformed data into the driver process from executors.
Functions that are used to transform data are called transformations and functions that are used to collect the distributed transformed results and create non-distributed data are called actions.
In simple terms, what a Spark application does is,
- It extracts the data from a data source and stores them in a data frame or in an RDD.
- Then it processes the data using transformation functions and collects the results using action functions.
- Then it loads the processed data again to a data source according to the requirement.
Simple Spark Programming Example
Spark application can be written in 3 steps. All you need is:
- Code to extract data from a data source.
- Code to process the extracted data.
- Code to load the data again into a data source.
On top of those three steps, we need a code to create a spark session in our application.
We know that when we apply to Spark, what Spark does is create a driver process, and that driver process executes our application using executors created by him.
Therefore, to create the connection between the driver process and our application, we need a spark session in our application.
Code structure might look as follows:
Let’s take a real-world example. Assume that we have to process a CSV data file, which contains students’ records.
In this CSV file, data is recorded in short forms. This means that if a student is from New York, it is recorded as “NY”. If it is California, it is recorded as “CA” and so on. Gender is recorded as “M” or “F” instead of male or female and the result of a test is stated as “P” or “F” instead of “pass” or “fail”.
So the students.csv file is something as follows:
- John,P,NJ,M
- James,P,CA,M
- Tom,P,NY,M
- Peter,P,NJ,M
- Colin,F,NY,M
Since these records do not make that much sense. To have meaningful information, we have to translate these short forms into long forms.
Let’s solve this using Spark. Keep in mind that some of you find this example a bit difficult. But don’t worry, you are still new to Spark. Just observe the code structure.
Here we are using Scala with SBT to write the Spark application. So you have to have Scala and SBT installed on your computer. You can do the same using Java and Python using Maven and Pip, respectively.
Since we are going to use Spark API, we should mention that in our build.sbt file. So your build.sbt file contains the following:
name := “spark-application-example”
version := “0.0.1”
scalaVersion := “2.12.7”
libraryDependencies += “org.apache.spark”%%”spark-sql”%”2.4.0”
You can specify the name and version as you prefer and specify the Scala version in your computer as the scalaVersion.
Create a folder and put your build.sbt file into that folder. Then create the folder structure, ./src/main/scala, in the folder you just created.
After that, your folder might have the following folder structure:
Inside the Scala directory, create a scala file named myFirstSparkApplication.scala to write our application. Let’s write our application:
/** * Extract data from a csv file and process the records into more * meaningful way. * */ import org.apache.spark.sql.SparkSession object myFirstSparkApplication { /** defining a case class */ case class students_case_class(Name : String, Result :String, State : String, Gender : String) def main(args: Array[String]): Unit ={ /** Creating a Spark session */ val spark = SparkSession .builder() .master("local") .appName("Spark Application") .getOrCreate() /** For implicit conversions like converting RDDs to DataFrames* */ import spark.implicits._ /** Extract data into Spark application */ val filePath = "/home/user/desktop/students.csv" val fileDF = spark.read.csv(filePath) /** Iterate through records and map them accordingly into the case class */ val processed_data = fileDF .map( row => students_case_class( row(0).toString, if (row(1).toString == "P") "Pass" else "Fail", row(2) match { case "NJ" => "New Jersey" case "NY" => "New York" case "CA" => "California" case "CO" => "Colorado" case "NE" => "Nebraska" case "TX" => "Texas" }, if (row(3).toString == "M") "Male" else "Female" ) ) processed_data.show() /** Load processed data to a csv file*/ processed_data.write.mode("overwrite").csv("/home/user/desktop/output") spark.stop() } }
filePath variable should contain the path to students.csv file and when you process data to a csv file, specify a directory, like /home/user/desktop/output, to gather the output files.
Now you can go to your project’s root directory, where the build.sbt file is, and run the command sbt run.
The .show() method will print the output in the terminal. The output will be in CSV format in the /home/user/desktop/output folder.
The output produced by .show() method will be as follows:
Other Components of Spark
Spark is a computing engine. It helps you to process data in a cluster environment. However, it doesn’t have a cluster manager and distributed storage system.
This means that Spark can compute some computations for you but it doesn’t know how to manage cluster resources and it doesn’t have a method to store the data it processes.
Therefore, plug in a cluster resource manager like Apache YARN, Mesos, or Kubernetes and a distributed storage like HDFS, Amazon S3, Google cloud storage, or Cassandra file system. The core computing feature provides a set of packages and libraries to enhance its capabilities.
Those packages and libraries are as follows:
- Spark SQL => Allows using SQL queries on top of structured data
- Spark streaming => Allows processing continuous data flows
- MLlib => Provides a set of machine learning algorithms
- GraphX => Allows graph computations in the Spark environment
- SparkR => R package for Spark environment
Frequently Asked Questions
Q #1) How do I learn Apache spark?
Answer: You can always refer to the Apache Spark documentation. But it is hard sometimes to learn directly from the documentation. So you can try tutorials on websites like this and YouTube. Always try to implement what you learn. It will teach you so much more.
=> Visit Official Spark documentation
Q #2) What is Apache spark in simple words?
Answer: Spark is a tool kit to process data. Meaning, if you have a big amount of data or you have a stream of data to process, you can do so by writing a Spark application.
Q #3) Is it easy to learn about Apache Spark?
Answer: Yes it is. Unlike previous processing techniques, like MapReduce, it is easy to learn. You just have to learn the basics right and you are ready.
Q #4) When should I use Spark?
Answer: You can use Spark to process both structured and unstructured data. Spark can also be used to process data streams, as well.
Q #5) Should I learn Apache spark?
Answer: If you are interested in big data, sure you should. Also if you are trying to build up a career in data analytics, knowing Spark will surely help you.
Q #6) Is Apache Spark a programming language?
Answer: No, it’s not. It is a processing engine. You can use programming languages like Scala and Python to build Spark applications.
Q #7) How long does it take to learn Apache?
Answer: It depends on how fast you can learn the basics. If you learn the fundamentals fast enough, it won’t take long. Surely you cannot learn it completely. There’s always some part you have to learn while you are using it.
Therefore, our suggestion is to get a solid understanding of the basics and practice them, then learn other things while using them.
Q #8) What is the difference between Hadoop and Spark?
Answer: Hadoop is a data processing framework for distributed systems. Hadoop has its own storage system, processing method, and resource manager. Spark is only a processing engine. It has neither a storage system nor a resource manager.
You can use Spark instead of Hadoop’s processing method, on top of Hadoop’s storage system and resource manager.
Conclusion
So we came to the end of the tutorial. If you get this far, you get to know a lot about Apache Spark. As mentioned earlier, this is a high-level overview of Spark. We tried to keep things as simple as possible, so you can understand the concept of Spark better.
Here we talked about big data. How it started and the storage and processing problems that arise because of the exponential growth of big data.
Then we learn how Hadoop was created as a solution and core Hadoop components, HDFS and MapReduce. We discussed issues with MapReduce and how Spark addresses those problems.
Then we went through what Spark is, How it is built the architecture of Spark, and how to write and execute a simple Spark application. Finally, we learned about other packages and libraries Spark provides. What they are and how to use them.
We hope this tutorial could give you a fundamental understanding of Apache Spark. As mentioned earlier, our aim was to give you the basic knowledge of Spark and set up a strong foundation, so you can build on top of it.
Since you have a thorough understanding of Spark fundamentals, you can work with Spark.
Good luck with the Spark journey. See you soon.