Skip to main content

ImportError: No module named 'pyspark'​- While running pyspark job in Transformer

  • January 3, 2022
  • 0 replies
  • 323 views

subashini
StreamSets Employee
Forum|alt.badge.img

Product: StreamSets Transformer

 

Issue:

Getting below error while running pyspark pipeline in transformer 

PYSPARK_01 - Python Command failed with error: Traceback (most recent call last): File "/disk/sdi/yarn/nm/usercache/anand.gadhiraju/appcache/application_1600456150942_29299/container_e212_1600456150942_29299_01_000001/tmp/1600796756381-0/python_code_runner.py", line 16, in <module> from pyspark import SparkContext, SparkConf, SQLContext ImportError: No module named 'pyspark'


Solution:

 

The above error comes when pyspark is not getting the python and pyspark libraries, So we need to add the below properties in the pipeline cluster config and please update the values according to your environment.
 

spark.home=/opt/cloudera/parcels/CDH/lib/sparkspark.executorEnv.PYSPARK_DRIVER_PYTHON=/usr/local/python3/bin/python3spark.executorEnv.PYSPARK_PYTHON=/usr/local/python3/bin/python3spark.submit.pyFiles=/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-*-src.zip,/opt/cloudera/parcels/CDH/lib/
This topic has been closed for replies.