Blog 0.3: Spark Resilent Distribution Dataset (RDD) Transformation
Spark: Resilent Distribution Dataset (RDD) Transformation
At high level, Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilent distributed dataset (RDD) which is a collection of elements partitioned across the nodes of the cluster that can be operated on a parallel.
RDD concept
Spark revolves around the concept of RDD, which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDD:
- Parallelizing an existing collection in your driver program
- Referencing a dataset in an external storage system, such as shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Creating a RDD 16 RDDs are created by starting with a file in the Hadoop file system or an existing Scala collection in the driver program, and transforming it.
We can set the RDD in memory which allows for efficient resuability operations. Following code using PySpark shows creating a spark session and performing parallel operation.
from pyspark.sql import SparkSession
# building SPARK Session that reads CSV files
spark= SparkSession\
.builder\
.appName("RDD ex")\
.getOrCreate()
output:
spark
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.0.1
Master
local[*]
AppName
RDD ex
RDD to perfrom Parallelizing
Raw data that is stored in memory can be used to convert and transform which is used with other dataframes. Here, in this example we can use Row to create a dataframe.
spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()
output:
DataFrame[_1: bigint]
Using Lambda function along with RDD
We can also specify key and value classes using lambda function. Here, creating a dataframe using the csv files from the directory.
rdd= spark.sparkContext.parallelize(range (1,4)).map(lambda x: (x, "a" *x))
output:
We can use method collect() to get array of objects to the driver program.
rdd.collect()
[(1, 'a'), (2, 'aa'), (3, 'aaa')]
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a COMPUTATION on the dataset.
I will talk more about the transformation and actions on the next upcoming blogs.