i trying to write avro format as wholefile format from kafka to localfs but i got error.below i share the pipeline ,can you please suggest how to done this

how to write avro file in wholefile format from kafka to local fs?

saleempothiwala
Headliner
Forum|Forum|2 years ago
March 8, 2023

Hi @Praveen P

I am a bit confused. You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file. What you are trying to do is read Kafka message in Avro so you cannot write it as a whole file.

Try writing as a .avro file in Local FS destination.

Like

Praveen P
Author
Fan
Forum|Forum|2 years ago
March 8, 2023

Hi @Praveen P

I am a bit confused. You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file. What you are trying to do is read Kafka message in Avro so you cannot write it as a whole file.

Try writing as a .avro file in Local FS destination

But my use case is read to from kafka and then store as a parquet file .In between i used to generate as avora file and then convert into parquetfile .exact what i want is local destination support to generate text file only then how can i generate .avro file format

Like

saleempothiwala
Headliner
Forum|Forum|2 years ago
March 8, 2023

@Praveen P

you can use events + executors.

Approach is very well documented here: https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Solutions/Parquet.html

Like

lakshmi_narayanan_t
Discovered Fame
Forum|Forum|2 years ago
March 8, 2023

@Praveen P

you can use events + executors.

Approach is very well documented here: https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Solutions/Parquet.html

@saleempothiwala can you please guide how to configure Hadoop in Streamsets pipeline. I installed and run the hadoop in single node in locally on desktop, but i don’t know how to configure it in streamsets destination can you guide to configure parameters .

Like

saleempothiwala
Headliner
Forum|Forum|2 years ago
March 8, 2023

@lakshmi_narayanan_t

The sample pipeline uses hadoop FS as destination, you can use Local FS if your use cases needs to write to a directory rather than hadoop.

If you want to use hadoop then the first requirement would be to make sure that your SDC has access to your hadoop file system. The other properties are pretty straight forward as shown here https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Destinations/HadoopFS-destination.html#concept_awl_4km_zq

Like

lakshmi_narayanan_t
Discovered Fame
Forum|Forum|2 years ago
March 9, 2023

@lakshmi_narayanan_t

The sample pipeline uses hadoop FS as destination, you can use Local FS if your use cases needs to write to a directory rather than hadoop.

If you want to use hadoop then the first requirement would be to make sure that your SDC has access to your hadoop file system. The other properties are pretty straight forward as shown here https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Destinations/HadoopFS-destination.html#concept_awl_4km_zq

@lakshmi_narayanan_t

The sample pipeline uses hadoop FS as destination, you can use Local FS if your use cases needs to write to a directory rather than hadoop.

If you want to use hadoop then the first requirement would be to make sure that your SDC has access to your hadoop file system. The other properties are pretty straight forward as shown here https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Destinations/HadoopFS-destination.html#concept_awl_4km_zq

my use_case if it possibile to sql server origin to kafka write it in avro format and then kafka consumer to in local fs as whole file format and then transfer it to avro to parquet in whole file format by another pipeline.if it possibile...

Like

Janeer
Fan
Forum|Forum|2 years ago
March 9, 2023

I usually do it in this way : Create a Kafka consumer instance to consume messages from a Kafka topic. Then set the consumer's properties including the Kafka bootstrap servers, group ID, and other relevant configuration properties. Then use the consumer's poll() method to retrieve messages from the Kafka topic. For each message retrieved, extract the Avro payload from the Kafka message. Write the Avro payload to a file on the local filesystem using the WholeFile format. And commit the offset of the last message processed to Kafka to ensure that the consumer does not process the same message again.

view software project

Like

lakshmi_narayanan_t
Discovered Fame
Forum|Forum|2 years ago
March 10, 2023

I usually do it in this way : Create a Kafka consumer instance to consume messages from a Kafka topic. Then set the consumer's properties including the Kafka bootstrap servers, group ID, and other relevant configuration properties. Then use the consumer's poll() method to retrieve messages from the Kafka topic. For each message retrieved, extract the Avro payload from the Kafka message. Write the Avro payload to a file on the local filesystem using the WholeFile format. And commit the offset of the last message processed to Kafka to ensure that the consumer does not process the same message again.

view software project

@Janeer can you share the exact configuration or how to implement poll() method in kafka consumer .

below I share my screeenshot for the kafka configuration and pipeline

7 Attachments

Like

saleempothiwala
Headliner
Forum|Forum|2 years ago
March 10, 2023

@lakshmi_narayanan_t

The first couple of paragraphs for Whole File Transformer processor document says this:

The Whole File Transformer processor transforms fully written Avro files to highly efficient, columnar Parquet files. Use the Whole File Transformer in a pipeline that reads Avro files as whole files and writes the transformed Parquet files as whole files.

Origins and destinations that support whole files include cloud storage stages such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, and as well as local and remote file systems such as Local FS, SFTP/FTP/FTPS Client, and Hadoop FS. For a full list of whole file origins and destinations or more information about the whole file data format.

Kafka does not support whole files. You will have to write the kafka messages to a file and then create an event that fires when the file finishes. There you can use this transformer to convert your Avro file to Parquet.

Like

saleempothiwala
Headliner
Forum|Forum|2 years ago
March 10, 2023

Also, StreamSets automatically reads the next message and also the offset management is handled by Kafka so you dont have to worry about it.

Like

lakshmi_narayanan_t
Discovered Fame
Forum|Forum|2 years ago
March 10, 2023

@lakshmi_narayanan_t

The first couple of paragraphs for Whole File Transformer processor document says this:

The Whole File Transformer processor transforms fully written Avro files to highly efficient, columnar Parquet files. Use the Whole File Transformer in a pipeline that reads Avro files as whole files and writes the transformed Parquet files as whole files.

Origins and destinations that support whole files include cloud storage stages such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, and as well as local and remote file systems such as Local FS, SFTP/FTP/FTPS Client, and Hadoop FS. For a full list of whole file origins and destinations or more information about the whole file data format.

Kafka does not support whole files. You will have to write the kafka messages to a file and then create an event that fires when the file finishes. There you can use this transformer to convert your Avro file to Parquet.

can you please provide some pipeline to achieve the these usecase.

my use_case is sqlserver origin to kafka write it in avro format and then kafka consumer to in local fs as whole file format and then transfer it to avro to parquet in whole file format by another pipeline or with in the same 2pipeline .