Introduction to Apache Spark Part 1

By Fadi Maalouli and Rick Hightower

Overview

Apache Spark, an open source cluster computing system, is growing fast. Apache Spark has a growing ecosystem of libraries and framework to enable advanced data analytics. Apache Spark's rapid success is due to its power and and ease-of-use. It is more productive and has faster runtime than the typical MapReduce based analysis. Apache Spark provides in-memory, distributed computing. It has APIs in Java, Scala, Python, and R. The Spark Ecosystem is shown below.

Display - Edit

The entire ecosystem is built on top of the core engine. The core enables in memory computation for speed and its API has support for Java, Scala, Python, and R.Streaming enables processing streams of data in real time. Spark SQL enables users to query structured data and you can do so with your favorite language, a DataFrame resides at the core of Spark SQL, it holds data as a collection of rows and each column in the row is named, with DataFrames you can easily select, plot, and filter data. MLlib is a Machine Learning framework. GraphX is an API for graph structured data. This was a brief overview on the ecosystem.

**A little history about Apache Spark:**

Originally developed in 2009 in UC Berkeley AMP lab, became open sourced in 2010, and now it is part of the top level Apache Software Foundation.

Has about 12,500 commits made by about 630 contributors (as seen on the Apache Spark Github repo).

Mostly written in Scala.

Google search interests for Apache Spark has sky rocketed recently, indicating a wide range of interest. (108,000 searches in July according to Google Ad Word Tools about ten times more than Microservices).

Some of Spark's distributors: IBM, Oracle, DataStax, BlueData, Cloudera...

Some of the applications that are built using spark: Qlik, Talen, Tresata, atscale, platfora...

Some of the companies that are using Spark: Verizon, NBC, Yahoo, Spotify...

The reason people are so interested in Apache Spark is it puts the power of Hadoop in the hands of developers. It is easier to setup an Apache Spark cluster than an Hadoop Cluster. It runs faster. And it is a lot easier to program. It puts the promise and power of Big Data and real time analysis in the hands of the masses. With that in mind, let's introduce `Apache Spark` in this quick tutorial.

Download Spark, and How to use the interactive shell

A great way to experiment with Apache Spark is to use the available interactive shells. There is a Python Shell and a Scala shell.

To download Apache Spark go here , and get the latest pre built version so we can run the shell out of the box.

Right now Apache Spark is version 1.4.1 released on July 15, 2015.

Unzip Spark

tar -xvzf ~/spark-1.4.1-bin-hadoop2.4.tgz

To run the Python shell

cd spark-1.4.1-bin-hadoop2.4
./bin/pyspark

We won't use the Python shell here in this section.

The Scala interactive shell runs on the JVM therefore it enables you to use Java libraries.

To run the Scala shell

cd spark-1.4.1-bin-hadoop2.4
./bin/spark-shell

You should see something like this:

The Scala shell welcome message

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
15/08/24 21:58:29 INFO SparkContext: Running Spark version 1.4.1

The following is a simple exercise just to get you started with the shell. You might not understand what we are doing right now but we will explain in detail later. With the Scala shell, do the following:

Create a `textFile` RDD from the README file in Spark

val textFile = sc.textFile("README.md")

Get the first element in the RDD `textFile`

textFile.first()

res3: String = # Apache Spark

You can filter the RDD textFile to return a new RDD that contains all the lines with the word Spark, then count its lines.

Filtered RDD `linesWithSpark` and count its lines

val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark.count()

res10: Long = 19

To find the line with the most amount of words in the RDD linesWithSpark do the following. Using the map method, map each line in the RDD to a number, and look for spaces. Then use the method reduce to look for the lines that has the most amount of words.

Find the line in the RDD `textFile` that has the most amount of words

textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

res11: Int = 14

Line 14 has the most words.

You can also import Java libraries for example like the Math.max() method because the arguments map and reduce are Scala function literals.

Importing Java methods in the Scala shell

import java.lang.Math
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))

res12: Int = 14

We can easily cache data in memory for example. Lets cache the filtered RDD linesWithSpark:

linesWithSpark.cache()
res13: linesWithSpark.type = MapPartitionsRDD[8] at filter at <console>:23

linesWithSpark.count()
res15: Long = 19

This was a brief overview on how to use the Spark interactive shell.

RDDs

Spark enables users to execute tasks in parallel on a cluster. This parallelism is made possible by using one of the main component of Spark, a RDD. A RDD (Resilient distributed data) is a representation of data. A RDD is data that can be partitioned on a cluster (sharded data if you will). The partitioning enables the execution of tasks in parallel. The more partitions you have, the more parallelism you can do. The diagram bellow is a representation of a RDD:

Display - Edit

Think of each column as a partition, you can easily assign these partitions to nodes on a cluster.

In order to create a RDD, you can read data from an external storage; for example from Cassandra or Amazon Simple Storage Service, HDFS, or any data that offers Hadoop input format. You can also create a RDD by reading a text file, an array, or JSON. On the other hand if the data is local to your application you just need to parallelize it then you will be able to apply all the Spark features on it and do analysis in parallel across the Apache Spark Cluster. To test it out, with a Scala Spark shell:

Make a RDD `thingsRDD` from a list of words

val thingsRDD = sc.parallelize(List("spoon", "fork", "plate", "cup", "bottle"))

thingsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:24

Count the word in the RDD `thingsRDD`

thingsRDD.count()

res16: Long = 5

In order to work with Spark you need to start with a Spark Context. When you are using a shell, Spark Context already exists as sc. When we call the parallelize method on the Spark Context, we will get a RDD that is partitioned and ready to be distributed across nodes.

What can we do with a RDD?

With a RDD, we can either transform data or take actions on that data. This means with a transformation we can change its format, search for something, filter data etc. With actions you make changes, you pull data out, collect data, and even count().

For example, lets create a RDD textFile from the text file README.md available inSpark, this file contains lines of text. When we read the file into the RDD with textFile, the data will get partitioned into lines of text which can be spread across the cluster and operated on in parallel.

Create RDD `textFile` from README.md

val textFile = sc.textFile("README.md")

Count the lines

textFile.count()

res17: Long = 98

The count 98 represents the amount of lines the file README.md has.

Will get something that looks like this:

Display - Edit

Then we can filter out all the lines that have the word Spark, and create a new RDD linesWithSpark that contains that filtered data.

Create the filtered RDD `linesWithSpark`

val linesWithSpark = textFile.filter(line => line.contains("Spark"))

Using the previous diagram where we showed how a textFile RDD would look like, the RDD linesWithSpark will look like the following:

Display - Edit

It is worth mentioning, we also have what is called a Pair RDD, this kind of RDD is used when we have a key/value paired data. For example if we have data like the following table, Fruits matching its color:

Display - Edit

We can execute a groupByKey() transformation on the fruit data to get:

pairRDD.groupByKey()

Banana [Yellow]
Apple [Red, Green]      
Kiwi [Green]
Figs [Black]

This transformation just grouped 2 values which are (Red and Green) with one key which is (Apple). These are examples of transformation changes so far.

Once we have filtered a RDD, we can collect/materialize its data and make it flow into our application, this is an example of an action. Once we do this, all the data in the RDD are gone, but we can still call some operations on the RDD's data since they are still in memory.

Collect or materialize the data in `linesWithSpark` RDD

linesWithSpark.collect()

Important to note that every time we call an action in Spark for example a count() action, Spark will go over all the transformations and computations done to that point and then return the count number, this will be somewhat slow. To fix this problem and increase the performance speed you can cache a RDD in memory. This way when you call an action time after time, you won't have to start the process from the beginning, you just get the results of the cached RDD from memory.

Cashing the RDD `linesWithSpark`

linesWithSpark.cache()

If you like to delete the RDD linesWithSpark from memory you can use the unpersist() method:

Deleting `linesWithSpark` from memory

linesWithSpark.unpersist()

Otherwise Spark automatically delete the oldest cashed RDD using the least recently used logic (LRU).

Here is a list to summarize the Spark process from start to end:

Create a RDD of some sort of data.
Transform the RDD's data by filtering for example.
Cache the transformed or filtered RDD if needed to be reused.
Do some actions on the RDD like pulling the data out, counting, storing data to Cassandra etc...

Here is a list of some of the transformations that can be used on a RDD:

filter()
map()
sample()
union()
groupbykey()
sortbykey()
combineByKey()
subtractByKey()
mapValues()
Keys()
Values()

Here is a list of some of the actions that can be made on a RDD:

collect()
count()
first()
countbykey()
saveAsTextFile()
reduce()
take(n)
countBykey()
collectAsMap()
lookup(key)

For the full lists with their descriptions, check out the following Spark documentation.

Have a team who wants to get started with Apache Spark?

Check out our Apache Spark QuickStart Course for real-time analytics.

This two-day course introduces experienced developers and architects to Apache Spark™. Developers will be enabled to build real-world, high-speed, real-time analytics systems. This course has extensive hands-on examples. The idea is introduce key concepts that make Apache Spark™ such an important technology. This course should prepare architects, development managers, and developers to understand the possibilities with Apache Spark™.

Read the full tutorials (which are updated periodicall) here

Reference

http://spark.apache.org/docs/

https://github.com/apache/spark

https://databricks.com/spark/about

Subscribe To

Rick

Sunday, September 20, 2015