Skip to main content

Transformer: Configuring EMR Cluster to run PySpark tranformations

  • November 26, 2021
  • 0 replies
  • 207 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

These are the steps which need to be followed to get PySpark transformations to work on EMR cluster [5.29]

 

Step 1: Ensure pyspark is installed on Worker and Core nodes

pip-3.6 install pyspark
pip-3.6 install pandas wheel

Check they’re installed properly with:

pip-3.6 show pandas pyspark

Step 2: add environment variables (all nodes):

Edit /etc/bashrc and add:

export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages

Step 3: On Master Node - edit environment variables

sudo su
chmod +x /etc/spark/conf/spark-env.sh
## Add the below to /etc/spark/conf/spark-env.sh
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export SPARK_DIST=/usr/lib/spark
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_DIST/python/lib/py4j-0.10.7-src.zip:$SPARK_DIST/python:/usr/bin/pyspark:$PYTHONPATH
## Add to /etc/bashrc
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
export SPARK_DIST=/usr/lib/spark
export PYTHONPATH=/usr/lib/python3.6/dist-packages:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip:/usr/lib/spark/python:/usr/bin/pyspark

 

Did this topic help you find an answer to your question?

0 replies

Be the first to reply!

Reply