Random Points

3 Easy Steps to Set Up Pyspark

Starting with Spark 2.2, it is now super easy to set up pyspark.

  1. Download Spark

    Download the spark tarball from the Spark website and untar it:

    $ tar zxvf spark-2.2.0-bin-hadoop2.7.tgz

  2. Install pyspark

    If you use conda, simply do:

    $ conda install pyspark

    or if you prefer pip, do:

    $ pip install pyspark

    Note that the py4j library would be automatically included.

  3. Set up environment variables

    Point to where the Spark directory is and where your Python executable is; here I am assuming Spark and Anaconda Python are both under my home directory. Set the following environment variables:

    export SPARK_HOME=~/spark-2.2.0-bin-hadoop2.7 export PYSPARK_PYTHON=~/anaconda/bin/python

    You can additionally set up ipython as your pyspark prompt as follows: export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython

That's it! There is no need to mess with $PYTHONPATH or do anything special with py4j like you would prior to Spark 2.2.

You can run a simple line of code to test that pyspark is installed correctly:

$ pyspark
Python 3.5.2 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:52:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0

Using Python version 3.5.2 (default, Jul  2 2016 17:52:12)
SparkSession available as 'spark'.

In [1]: sc.parallelize(range(100)).reduce(lambda x, y: x + y)
Out[1]: 4950