Starting with Spark 2.2, it is now super easy to set up pyspark.
-
Download Spark
Download the spark tarball from the Spark website and untar it:
$ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz
-
Install pyspark
If you use
conda
, simply do:$ conda install pyspark
or if you prefer
pip
, do:$ pip install pyspark
Note that the
py4j
library would be automatically included. -
Set up environment variables
Point to where the Spark directory is and where your Python executable is; here I am assuming Spark and Anaconda Python are both under my home directory. Set the following environment variables:
export SPARK_HOME=~/spark-2.2.0-bin-hadoop2.7 export PYSPARK_PYTHON=~/anaconda/bin/python
You can additionally set up
ipython
as your pyspark prompt as follows:export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython
That's it! There is no need to mess with $PYTHONPATH
or do anything special with py4j
like you would prior to Spark 2.2.
You can run a simple line of code to test that pyspark is installed correctly:
$ pyspark
Python 3.5.2 |Anaconda custom (x86_64)| (default, Jul 2 2016, 17:52:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Python version 3.5.2 (default, Jul 2 2016 17:52:12)
SparkSession available as 'spark'.
In [1]: sc.parallelize(range(100)).reduce(lambda x, y: x + y)
Out[1]: 4950