These are the steps which need to be followed to get PySpark transformations to work on EMR cluster [5.29]
Step 1: Ensure pyspark is installed on Worker and Core nodes
pip-3.6 install pyspark
pip-3.6 install pandas wheel
Check they’re installed properly with:
pip-3.6 show pandas pyspark
Step 2: add environment variables (all nodes):
Edit /etc/bashrc and add:
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages
Step 3: On Master Node - edit environment variables
sudo su
chmod +x /etc/spark/conf/spark-env.sh
## Add the below to /etc/spark/conf/spark-env.sh
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export SPARK_DIST=/usr/lib/spark
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_DIST/python/lib/py4j-0.10.7-src.zip:$SPARK_DIST/python:/usr/bin/pyspark:$PYTHONPATH
## Add to /etc/bashrc
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export SPARK_DIST=/usr/lib/spark
export PYTHONPATH=/usr/lib/python3.6/dist-packages:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip:/usr/lib/spark/python:/usr/bin/pyspark