PySpark-0-Installation

Introduction

This article is to introduce how to install PySpark on Linux using different methods (pip and install manually). Then I will talk about some issues I met during installation. Note: before installing PySpark, we need to install Java and set the JAVA_HOME environment variable first

Install PySpark via pip

Install PySpark with pip

The simplest way to install PySpark in linux is to use pip tool to install pyspark. The commands are as follow:

1
pip install pyspark

To specify the extra dependencies for extra components, we can add [component_name, ...]. For example, if we want to install components for SQL, then

1
pip install pyspark[sql]

In the default distribution, PySpark uses Hadoop3.2 and Hive2.3 as default. If we want to specify the version of Haddop / Hive version or distribution URL, we can do the following (by setting PYSPARK_HADOOP_VERSION=… before pip):

1
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install

Setting Environment Variables

After installing pyspark using pip, we need to put the following environment variables to ~/.bashrc file to tell pyspark about the settings we want.

1
2
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

Note: if we have multiple python versions, like python2 and python3, we need to set these two environment variables to tell PySpark which python we use.

Install PySpark Manually

Install PySpark with source code

  1. First, we need to download a suitable version of PySpark from the official website: https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz and then extract pyspark by this command (here I choose pyspark 3.1.2 with hadoop3.2).

    1
    tar xf spark-3.1.2-bin-hadoop3.2.tgz
  2. Install the following dependencies using pip install pandas, numpy, pyarrow, py4j. We should also check the minimum versions of those packages: http://spark.apache.org/docs/latest/api/python/getting_started/install.html

    • pandas: optional for SQL
    • Numpy: required
    • pyarrow: optional for SQL
    • Py4J: it interprets python to JVM code
    • findspark: it tells the python program about where to find pyspark

Note that when using pip to install pyspark, it automatically installs those dependencies and tell the system

Setting Environment Variables

We also need to tell the system about where to find pyspark and some settings about pyspark, so we need to do the following:

1
2
3
4
5
6
cd  spark-3.0.0-bin-hadoop2.7  #the path to the root directory of pyspark
export SPARK_HOME=`pwd`
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export JAVA_HOME= directory_to_your_jdk

Alternatively, we can add the following codes to ~/.bashrc file so that we can avoid repeatting using these codes before running the program in the future.

1
2
3
4
5
6
# in ~/.bashrc
export SPARK_HOME=spark-3.0.0-bin-hadoop2.7 #the path to the root directory of pyspark
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export JAVA_HOME= directory_to_your_jdk

Then

1
source ~/.bashrc

Setting Environment Variables during Runtime

We can also use os.environ[..] = ".."method to setup the environment variables as well

1
2
3
4
5
6
7
8
9
10
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" # path to java jdk
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7" # path to spark home
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
import findspark
findspark.init("spark-3.1.1-bin-hadoop2.7")# use findspark to find the root of pyspark, SPARK_HOME

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Template when using PySpark with Google Colab

Here is a template of using pyspark in runtime in Google colab.
Please also update the link of pyspark with the url of the latest version of PySpark. We can check it here: https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

1
2
3
4
5
6
7
8
9
10
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.forsale.plus/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install py4j

!export JAVA_HOME=$(/usr/lib/jvm/java-8-openjdk-amd64 -v 1.8)
! echo $JAVA_HOME
1
2
3
4
5
6
7
8
9
10
11
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"

import findspark
findspark.init("spark-3.1.1-bin-hadoop2.7")# path to the home of pyspark: SPARK_HOME

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Install Spark in Distributed Cluster

In industry, we usually need to install spark in distributed cluster rather than local machine, so that we can leverage the power of distributed computing. In order to install Spark in distributed cluster, please check this link: https://www.hadoopdoc.com/spark/spark-install-distribution

Problems during Installation

  1. Without specifying path to python for pyspark
    If we don’t add the following env variables to specify path of python when we have different versions of python, like python2 and python3, it will pop up exception when we use show() method to display dataframe.

    1
    2
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

    Error when using df.show() to display spark dataframe

    1
    2
    3
    Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions. 
    Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
    ...
  2. Set the environment variables for Pyspark
    when we use pip method to install pyspark, there is no need to specificy SPARK_HOME or PYTHONPATH. However, when we install it manually, we do need to specify them and also use findspark.init(…) to search the home directory of pyspark.

  3. For other common issues, check this: https://towardsdatascience.com/pyspark-debugging-6-common-issues-8ab6e7b1bde8


Reference

[1] PySpark Org: http://spark.apache.org/docs/latest/api/python/getting_started/install.html

[2] StackOverflow: https://stackoverflow.com/questions/48260412/environment-variables-pyspark-python-and-pyspark-driver-python

[3] https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
[4] issues in pyspark: https://towardsdatascience.com/pyspark-debugging-6-common-issues-8ab6e7b1bde8

[5] Tutorial of Spark( using scala): https://www.hadoopdoc.com/spark/spark-sparkcontext

Comments