Transformer: How to Configuring Dataproc Cluster to run PySpark transformations

  • 26 November 2021
  • 0 replies
  • 34 views

Userlevel 3
  • StreamSets Employee
  • 0 replies

Product:  StreamSets Transformer

 

Issue:

Steps which need to be followed to get PySpark transformations to work on DataProc cluster

 

Solution:

PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

 

You can configure your desire python version by following the below document from Dataproc Guide

Configure the cluster's Python environment 

 

Note: The PySpark processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

 

Now you have python available to dataproc cluster, you need to change your spark configuration to point to this python.

 

 

Or Rest equivalent option for dataproc 

 


0 replies

Be the first to reply!

Reply