Pipeline design consideration - Drift in cluster streaming mode.


Userlevel 3

Question:

We are trying to do a Hive/Impala Data Drift in cluster streaming mode. However, the pipeline stalls (new partitions) and often fails (new columns) when jobs try to simultaneously hit the Hive/Impala Executor.

Should we be streaming all the Data Drift events to a single pipeline and de-dupe them to manage the Hive/Impala changes?

What is the recommended way to proceed here?

Answer:

The general recommendation is not to hit the Hive Metastore from multiple pipelines. The reason behind this approach revolves around the premise that two pipelines could try to create a table or add a column at the same time. Having Hive Metastore target, hence, in a separate pipeline and funneling all the metadata records via Kafka/SDC RPC is the optimal way to design it in such a scenario.

The Hive metastore target is capable of de-duplicating events and manages that with just one query if it is all done via a single instance of the stage.  And, this is applicable for cluster pipelines/multiple standalone pipelines.  For multi-threaded pipelines, it is not needed because we make use of the shared cache.

Note: If using Kafka stage to process the records from the pipeline to another pipeline with Hive Metastore destination, make sure that you pick SDC Record Data type.


0 replies

Be the first to reply!

Reply