Blog 0.1: PySpark Installation and setup
Apache Spark and Pyspark
In the previous blog, I talked about installing the required binary files and Path setup to run Apache on the windows. Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces.
The Spark framework can load the data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, and others.
Advantage of Using SPARK
- Scalability
Spark runs with almost every data sources out there. Spark runs on Hadoop, Kubernetes, standalone, or in the cloud. We can run Spark using its standalone cluster mode on EC2, on Hadoop YARN, on Kubernetes.
Also, we can deploy Spark on a cluster if you’d like to run it in a distributed mode.
In the current and upcoming Blogs, I will demonstrate the usage of Spark locally.
- Speed
In terms of speed execution unlike Hadoop Spark maintains the intermediate output in Memory rather than writing every intermediate result to disk. This execution process will hugely cuts down the execution time of the operation, resulting faster execution of task, as more as 100X time as standard MapReduce job.
Setup Steps
As I have already discussed on setting up environment variables and Path on my previous blog.
- SPARK_HOME
- JAVA_HOME
- HADOOP_HOME
- winutils.exe
In this blog, I will provide the steps for PySpark installation process using Anaconda on Jupyter Notebook.
conda install -c conda-forge findspark
After performing the above steps, you can check import the module and check the pyspark on the jupyter notebook.
import findspark
findspark.find()
output:
'C:\\Users\\limbu\\spark-3.0.1-bin-hadoop3.2'
Determing the PATH variables
Before approaching with the PySpark, we need to make sure the path variables are correctly initialized. Since, the Apache Framework with Python is fast evolving infrastructure, it requires several necessary steps. Having the wrong path setup can cause error while running PySpark.
We can run the following commands to check the proper PATH setup:
import os
print(os.environ['SPARK_HOME'])
print(os.environ['JAVA_HOME'])
print(os.environ['HADOOP_HOME'])
output:
C:\Users\limbu\spark-3.0.1-bin-hadoop3.2
C:\Program Files\Java\jdk1.8.0_281
C:\Users\limbu\spark-3.0.1-bin-hadoop3.2