Demystifying impersonation in StreamSets Transformer?

3 years ago
January 20, 2022
0 replies
52 views

AkshayJadhav
StreamSets Employee
101 replies

In this article, we will review two basic scenarios.

Impersonation in transformer without Kerberos
Impersonation in transformer with Kerberos

Before we jump to any scenario their are few pre-requisite steps which we need to follow

Transformer Proxy Users Prerequisite

When using a Hadoop YARN cluster manager, the following directories must exist:

1. Spark node local directories

The Spark yarn.nodemanager.local-dir configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node.

The value of the configuration parameter should be available in the cluster manager user interface. By default, the property is set to ${hadoop.tmp.dir}/nm-local-dir.

The specified directories must meet all of the following requirements for each node of the cluster:

Exist on each node of the cluster.
Be owned by YARN.
Have read permission granted to the Transformer proxy user.

2. HDFS application resource directories

Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as follows:

/user/<Transformer proxy user name>

Ensure that both of the following requirements are met:

Each resource directory exists on HDFS.
Each Transformer proxy user has read and write permission on their resource directory.

For example, you might use the following command to add a Transformer user, rishi, to a spark user group:

usermod -aG spark rishi

Then, you can use the following commands to create the /user/rishi directory and ensure that the spark user group has the correct permissions to access the directory:

sudo -u hdfs hdfs dfs -mkdir /user/rishi 
sudo -u hdfs hdfs dfs -chown rishi:spark /user/rishi 
sudo -u hdfs hdfs dfs -chmod -R 775 /user/rishi

Now let's explore each scenario

Impersonation without Kerberos: This one is very straightforward and doesn't require many changes.

As the user-defined in the pipeline properties - When configured, Transformer uses the specified Hadoop user to launch the Spark application and to access files in the Hadoop system.

Impersonation with Kerberos

This is the scenario where it became a bit complex. We will try to deep dive and try to understand how this works.

So the Transformer allows you two options to specify the keytab/principal

1. Properties file:

When a pipeline uses the properties file as the keytab source, the pipeline uses the same Kerberos keytab and principal configured for Transformer in the Transformer configuration file, $TRANSFORMER_DIST/etc/transformer.properties.

For information about specifying the Kerberos keytab in the Transformer configuration file, see Enabling the Properties File as the Keytab Source.

2. Pipeline configuration:

When a pipeline uses the pipeline configuration as the keytab source, you define a specific Kerberos keytab file and principal to use for the pipeline. Store the keytab file on the Transformer machine. In the pipeline properties, you define the absolute path to the keytab file and the Kerberos principal to use for that key

Default Behaviour :

When using a keytab, Transformer uses the Kerberos principal to launch the Spark application and to access files in the Hadoop system.

Note: Transformer ignores the Hadoop user defined in the pipeline properties

So now we know the default behavior, How to run a transformer job as logged in user for example you logged in as the desired proxy user in transformer ( i.e rishi). And our end goal is that we want to run this job as a user.

Transformer Side Changes

1. Enable the hadoop.always.impersonate.current.user property in the Transformer configuration file, $TRANSFORMER_DIST/etc/transformer.properties

2. Before pipelines can use proxy users with Kerberos authentication, you must install the required Kerberos client packages on the Transformer machine and then configure the environment variables used by the K5start program.

Tip: Spark recommends using a Kerberos principal and keytab rather than a proxy user. To require that pipelines be configured with a Kerberos principal and keytab, do not enable proxy users.

On Linux, install the following Kerberos client packages on the Transformer machine:

krb5-workstation
krb5-client
K5start, also known as kstart

#Add the epel repo to yum
$ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm

```
$ yum install kstart.x86_64
```

the keytab file that contains the credentials for the Kerberos principal to the Transformer machine.

Add the following environment variables to the Transformer environment configuration file.

Modify environment variables using the method required by your installation type.

Environment Variable	Description
TRANSFORMER_K5START_CMD	Absolute path to the `K5start` program on the Transformer machine.
TRANSFORMER_K5START_KEYTAB	Absolute path and name of the Kerberos keytab file copied to the Transformer machine.
TRANSFORMER_K5START_PRINCIPAL	Kerberos principal to use. Enter a service principal.

Restart Transformer.

If you are wondering why we are doing this then let's understand this:

Background:

The --proxy-user argument to spark-submit allows you to run a Spark job as a different user, besides the one whose keytab you have.

That said, the recommended way of submitting jobs is always to use the --principal and --keytab options. This allows Spark to both keep the tickets updated for long-running applications, and also seamlessly distribute the keytab to any nodes that happen to need it (i.e. if they are spinning up an executor).

It is possible to use --proxy-user with Transformer. and the way we achieve this via k5start.

k5start : It takes care of obtaining a ticket from the KDC (like kinit), but it also refreses the token periodically based on the parameters, similar to what SecurityContext in Data Collector does. However, unlike SecurityContext, it writes to a ticket cache and keeps that cache updated as tickets expire, so that child processes (like spark-submit) can always get the latest version.

Validation Test:

1. Login with the desired proxy user in transformer

2. Create Test pipeline ( Dev Raw Data --> Trash )

3. No need to change anything at pipeline level Now.

4. Run this job and. check the corresponding jobs in yarn. it should be submitted as a proxy user now.

Few Caveats

Pipelines fail to start when the following combination of things are all true:

The cluster is Kerberized and uses YARN
The transformer has been configured to use impersonation, and the user has configured the pipeline without specifying a keytab (meaning, impersonate the currently logged in user, provided that hadoop.always.impersonate.current.user=true)
The pipeline is run in CLUSTER deploy mode

Workaround: Run in CLIENT mode || This issue has been fixed in 3.15

How to Troubleshoot?

One quick way to verify is by using proxy-user arg with cluster mode, for SparkPi (or any example job), as a sanity check.

first, run kinit to initialize the user from keytab (the principal that would run Transformer)
next, for CDH 6.x
```
spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 1G \
  --proxy-user rishi \
  --num-executors 1 \
  --conf spark.hadoop.ipc.client.connect.max.retries=1 \
  /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_*.jar 1000
```
In the example command above, they would need to replace "rishi" with the actual logged-in user name they are trying to impersonate. This will trigger the same flow that is performed by a Transformer when doing an impersonation.

Did this topic help you find an answer to your question?

Be the first to reply!

Transformer Proxy Users Prerequisite

Background:

Validation Test:

Reply

Related topics

Encryption and Decryption of a file using scripticon

Reading json file from aws bucket using groovy script in groovy_scripting.icon

File encryption using RSA algorithmicon

Calling Runtime Values from within Jython Evaluator.

How to rename a file with its original file name with the Local FS stageicon

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded