Product: StreamSets Transformer
Issue:
Getting below error while running pyspark pipeline in transformer
PYSPARK_01 - Python Command failed with error: Traceback (most recent call last): File "/disk/sdi/yarn/nm/usercache/anand.gadhiraju/appcache/application_1600456150942_29299/container_e212_1600456150942_29299_01_000001/tmp/1600796756381-0/python_code_runner.py", line 16, in <module> from pyspark import SparkContext, SparkConf, SQLContext ImportError: No module named 'pyspark'
Solution:
The above error comes when pyspark is not getting the python and pyspark libraries, So we need to add the below properties in the pipeline cluster config and please update the values according to your environment.
spark.home=/opt/cloudera/parcels/CDH/lib/sparkspark.executorEnv.PYSPARK_DRIVER_PYTHON=/usr/local/python3/bin/python3spark.executorEnv.PYSPARK_PYTHON=/usr/local/python3/bin/python3spark.submit.pyFiles=/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-*-src.zip,/opt/cloudera/parcels/CDH/lib/
