I have a JDBC URL, as the following, in my HiveMetadata and HiveMetastore processor:
jdbc:hive2://node1.domain.com:10000/default;principal=hive/node1.domain.com@domain.com;
But when I hit pipeline PREVIEW, I am faced with the following error:
com.streamsets.pipeline.api.base.OnRecordErrorException: HIVE_23 - TBL Properties 'com.streamsets.pipeline.stage.lib.hive.exceptions.HiveStageCheckedException: HIVE_20 - Error executing SQL: DESCRIBE DATABASE `my_db`, Reason:Error while compiling statement: FAILED: SemanticException No valid privileges
User sdc does not have privileges for DESCDATABASE
The required privileges: Server=server1->Db=my_db->action=select;Server=server1->Db=my_db->action=insert;' Mismatch: Actual: {} , Expected: {}
at com.streamsets.pipeline.stage.processor.hive.HiveMetadataProcessor.process(HiveMetadataProcessor.java:595)
at com.streamsets.pipeline.api.base.RecordProcessor.process(RecordProcessor.java:52)
at com.streamsets.pipeline.configurablestage.DProcessor.process(DProcessor.java:35)
at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:246)
at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:236)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:194)
at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:263)
at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:225)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.lambda$runSourceLessBatch$0(PreviewPipelineRunner.java:343)
at com.streamsets.datacollector.runner.PipeRunner.executeBatch(PipeRunner.java:132)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runSourceLessBatch(PreviewPipelineRunner.java:339)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:321)
at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:215)
at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:514)
at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51)
at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:207)
at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:227)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:223)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:227)
at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Now, the above stack says that the user 'sdc' does not have the necessary privileges on the DB I am trying to access. True. But I am logged in as 'user1' into the SDC UI and have also enabled impersonation in SDC via 'stage.conf_hadoop.always.impersonate.current.user'. Then why is the client calling out to HiveServer2 as 'sdc' and not 'user1' that has the needed privileges?
So, how to configure a pipeline accessing a JDBC end-point (HiveServer2 in this case) in such a way that the user logged in to the SDC UI will be the one eventually reaching out to the HiveServer2 world and not the SDC service account, which is the default behavior?
Answer:
I wish the answer was simple and straightforward but learned a couple of things myself while unraveling this can of worms. Hence, this post is to try and save another soul.
Note: This is not meant to be a step-by-step guide, but a resource to give the reader a sense of direction, and references to reach out to when delving into this area.
Misconception 1 - Enabling Hadoop impersonation in SDC via 'stage.conf_hadoop.always.impersonate.current.user' enables impersonation in HiveServer2 as well.
False.
Enabling this property only affects the HDFS world and not the HiveServer2 because this is applicable only to the Hadoop components that are depending on the UserGroupInformation (UGI)
concept for doing an impersonation. Here, Hive follows a different route. But it is very common to find users to assume that impersonation will be in place across the board once this has been configured in SDC.
Most likely, this confusion is arising from the below statement in the same document,
You can configure Data Collector to only use the currently logged in Data Collector user by enabling the stage.conf_hadoop.always.impersonate.current.user property in the Data Collector configuration file. When enabled, configuring a user within a stage is not allowed.
The above does not make it evident the difference in behaviour that is currently prevalent between the 2 worlds - HDFS & HiveServer2.
So, for a use case where Trusted Delegation with HiveServer2 (different from HiveServer2 Impersonation) is involved, you need to specify the desired user in the JDBC URL. And that differs based on what JDBC Driver you're using:
- For the bundled Hive JDBC Driver, one needs to specify the following JDBC property: hive.server2.proxy.user which is covered in the docs here.
- For the Cloudera JDBC Driver, one should use the DelegationUID property instead which is covered in a different doc here.
Since impersonation in Hive depends on the JDBC driver used, we do not provide the ability to impersonate the current user for Hive (Processor, Destination) in a way similar to that done for HDFS.
Nevertheless, as of this writing, there is an enhancement JIRA SDC-8066 open to improving this part of the user experience.
Misconception 2 - SDC has an <hs2_principal> principal (jdbc:...principal=hive/node1.domain.com@domain...) part specified in the JDBC URL when connecting to HiveHerver2. This will enable HiveServer2 to use the user 'hive' to make query calls.
False.
Firstly, this (the principal) is a must only for all Hive JDBC URLs using Kerberos. HiveServer2 still creates a session as the SDC service account user 'sdc'. Most likely, since there are no necessary privileges to run something as an 'sdc' user, the above-mentioned error stack is thrown.
User sdc does not have privileges for DESCDATABASE
Do make a note that specifying the principal in the URL has nothing to do with impersonation. Yet another common misunderstanding. Though the parameter principal is a stated requirement, it is not something that affects impersonation.
Also, the following thread will give a quick overview of how to configure a Hive/Impala JDBC driver for SDC.
Hope that helps.