3 Easy Steps to Set Up Pyspark

Starting with Spark 2.2, it is now super easy to set up pyspark.

Download Spark

Download the spark tarball from the Spark website and untar it:

$ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz
Install pyspark

If you use conda, simply do:

$ conda install pyspark

or if you prefer pip, do:

$ pip install pyspark

Note that the py4j library would be automatically included.
Set up environment variables

Point to where the Spark directory is and where your Python executable is; here I am assuming Spark and Anaconda Python are both under my home directory. Set the following environment variables:

export SPARK_HOME=~/spark-2.2.0-bin-hadoop2.7 export PYSPARK_PYTHON=~/anaconda/bin/python

You can additionally set up ipython as your pyspark prompt as follows: export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython

That's it! There is no need to mess with $PYTHONPATH or do anything special with py4j like you would prior to Spark 2.2.

You can run a simple line of code to test that pyspark is installed correctly:

$ pyspark
Python 3.5.2 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:52:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:52:12)
SparkSession available as 'spark'.

In [1]: sc.parallelize(range(100)).reduce(lambda x, y: x + y)
Out[1]: 4950

Random Points

3 Easy Steps to Set Up Pyspark

Comments