Transformer: How to Configuring Dataproc Cluster to run PySpark transformations

3 years ago
November 26, 2021
0 replies
36 views

Sami
StreamSets Employee

Product: StreamSets Transformer

Issue:

Steps which need to be followed to get PySpark transformations to work on DataProc cluster

Solution:

PySpark jobs on Dataproc are run by a Python interpreter on the cluster. Job code must be compatible at runtime with the Python interpreter's version and dependencies.

You can configure your desire python version by following the below document from Dataproc Guide

Configure the cluster's Python environment

Note: The PySpark processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

Now you have python available to dataproc cluster, you need to change your spark configuration to point to this python.

Or Rest equivalent option for dataproc

Did this topic help you find an answer to your question?

Be the first to reply!

Issue:

Solution:

Reply

Related topics

Mongo DB Data Source Connectivityicon

🖇 Ataccama Default Connections - Part 1️⃣

Data Source - BOX.COM Connectivityicon

🖇️ Ataccama default connections - Part 2️⃣

DQ - Monitoring project "Export"icon

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded