Skip to main content

Transformer: How to Configuring Dataproc Cluster to run PySpark transformations

  • November 26, 2021
  • 0 replies
  • 36 views

Sami
StreamSets Employee
  • StreamSets Employee

Product:  StreamSets Transformer

 

Issue:

Steps which need to be followed to get PySpark transformations to work on DataProc cluster

 

Solution:

PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

 

You can configure your desire python version by following the below document from Dataproc Guide

Configure the cluster's Python environment 

 

Note: The PySpark processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

 

Now you have python available to dataproc cluster, you need to change your spark configuration to point to this python.

 

 

Or Rest equivalent option for dataproc 

 

Did this topic help you find an answer to your question?

0 replies

Be the first to reply!

Reply