pyspark

Apache Spark and Pyspark

In the previous blog, I talked about installing the required binary files and Path setup to run Apache on the windows. Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces.

The Spark framework can load the data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, and others.

Advantage of Using SPARK

Scalability

Spark runs with almost every data sources out there. Spark runs on Hadoop, Kubernetes, standalone, or in the cloud. We can run Spark using its standalone cluster mode on EC2, on Hadoop YARN, on Kubernetes.

Also, we can deploy Spark on a cluster if you’d like to run it in a distributed mode.

In the current and upcoming Blogs, I will demonstrate the usage of Spark locally.

Speed

In terms of speed execution unlike Hadoop Spark maintains the intermediate output in Memory rather than writing every intermediate result to disk. This execution process will hugely cuts down the execution time of the operation, resulting faster execution of task, as more as 100X time as standard MapReduce job.

Setup Steps

As I have already discussed on setting up environment variables and Path on my previous blog.

SPARK_HOME
JAVA_HOME
HADOOP_HOME
winutils.exe

Spark Env. and PATH setup

In this blog, I will provide the steps for PySpark installation process using Anaconda on Jupyter Notebook.

conda install -c conda-forge findspark

After performing the above steps, you can check import the module and check the pyspark on the jupyter notebook.

import findspark

findspark.find()

output:

'C:\\Users\\limbu\\spark-3.0.1-bin-hadoop3.2'

Determing the PATH variables

Before approaching with the PySpark, we need to make sure the path variables are correctly initialized. Since, the Apache Framework with Python is fast evolving infrastructure, it requires several necessary steps. Having the wrong path setup can cause error while running PySpark.

We can run the following commands to check the proper PATH setup:

import os

print(os.environ['SPARK_HOME'])
print(os.environ['JAVA_HOME'])
print(os.environ['HADOOP_HOME']) 

output:

C:\Users\limbu\spark-3.0.1-bin-hadoop3.2
C:\Program Files\Java\jdk1.8.0_281
C:\Users\limbu\spark-3.0.1-bin-hadoop3.2