PySpark-0-Installation
Introduction
This article is to introduce how to install PySpark on Linux using different methods (pip and install manually). Then I will talk about some issues I met during installation. Note: before installing PySpark, we need to install Java and set the JAVA_HOME environment variable first
Install PySpark via pip
Install PySpark with pip
The simplest way to install PySpark in linux is to use pip
tool to install pyspark. The commands are as follow:
1 | pip install pyspark |
To specify the extra dependencies for extra components, we can add [component_name, ...]
. For example, if we want to install components for SQL, then
1 | pip install pyspark[sql] |
In the default distribution, PySpark uses Hadoop3.2 and Hive2.3 as default. If we want to specify the version of Haddop / Hive version or distribution URL, we can do the following (by setting PYSPARK_HADOOP_VERSION=… before pip):
1 | PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install |
Setting Environment Variables
After installing pyspark using pip
, we need to put the following environment variables to ~/.bashrc
file to tell pyspark about the settings we want.
1 | export PYSPARK_PYTHON=/usr/bin/python3 |
Note: if we have multiple python versions, like python2 and python3, we need to set these two environment variables to tell PySpark which python we use.
Install PySpark Manually
Install PySpark with source code
First, we need to download a suitable version of PySpark from the official website: https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz and then extract pyspark by this command (here I choose pyspark 3.1.2 with hadoop3.2).
1
tar xf spark-3.1.2-bin-hadoop3.2.tgz
Install the following dependencies using
pip install pandas, numpy, pyarrow, py4j
. We should also check the minimum versions of those packages: http://spark.apache.org/docs/latest/api/python/getting_started/install.html- pandas: optional for SQL
- Numpy: required
- pyarrow: optional for SQL
- Py4J: it interprets python to JVM code
- findspark: it tells the python program about where to find pyspark
Note that when using pip
to install pyspark, it automatically installs those dependencies and tell the system
Setting Environment Variables
We also need to tell the system about where to find pyspark and some settings about pyspark, so we need to do the following:
1 | cd spark-3.0.0-bin-hadoop2.7 #the path to the root directory of pyspark |
Alternatively, we can add the following codes to ~/.bashrc
file so that we can avoid repeatting using these codes before running the program in the future.
1 | # in ~/.bashrc |
Then
1 | source ~/.bashrc |
Setting Environment Variables during Runtime
We can also use os.environ[..] = ".."
method to setup the environment variables as well
1 | import os |
Template when using PySpark with Google Colab
Here is a template of using pyspark in runtime in Google colab.
Please also update the link of pyspark with the url of the latest version of PySpark. We can check it here: https://www.apache.org/dyn/closer.lua/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
1 | !apt-get update |
1 | import os |
Install Spark in Distributed Cluster
In industry, we usually need to install spark in distributed cluster rather than local machine, so that we can leverage the power of distributed computing. In order to install Spark in distributed cluster, please check this link: https://www.hadoopdoc.com/spark/spark-install-distribution
Problems during Installation
Without specifying path to python for pyspark
If we don’t add the following env variables to specify path of python when we have different versions of python, like python2 and python3, it will pop up exception when we useshow()
method to display dataframe.1
2export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3Error when using
df.show()
to display spark dataframe1
2
3Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
...Set the environment variables for Pyspark
when we usepip
method to install pyspark, there is no need to specificySPARK_HOME
orPYTHONPATH
. However, when we install it manually, we do need to specify them and also use findspark.init(…) to search the home directory of pyspark.For other common issues, check this: https://towardsdatascience.com/pyspark-debugging-6-common-issues-8ab6e7b1bde8
Reference
[1] PySpark Org: http://spark.apache.org/docs/latest/api/python/getting_started/install.html
[2] StackOverflow: https://stackoverflow.com/questions/48260412/environment-variables-pyspark-python-and-pyspark-driver-python
[3] https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
[4] issues in pyspark: https://towardsdatascience.com/pyspark-debugging-6-common-issues-8ab6e7b1bde8
[5] Tutorial of Spark( using scala): https://www.hadoopdoc.com/spark/spark-sparkcontext