Solved

Data Transformer: AWS S3 Origin read all the files under S3 bucket recursively

3 years ago
October 18, 2021
8 replies
808 views

shivu2483
Fan
4 replies

Hi Team,

I am trying to read the files from AWS S3 using the Amazon S3 origin. It does read the file at first level but doesn’t read the file recursively.

In the below image, for bucket “s3a://sotero-transformer/input/”, it reads the files userdata1.parquet and userdata2.parquet but doesn’t read the files under “dirone”.

What should be the configuration to read all the files under S3 bucket recursively?

Best answer by wilsonshamim

@shivu2483 : whats the transformer version? also can you confirm if you have parquet files in dirone aswell?

View original

Did this topic help you find an answer to your question?

+1

wilsonshamim
StreamSets Employee
25 replies
3 years ago
October 19, 2021

Hi @shivu2483 :

can you try using Bucket url : s3a://sotero-transformer/input/* and Object name pattern as *

shivu2483
Author
Fan
4 replies
3 years ago
October 20, 2021

@wilsonshamim Thanks for looking into this issue. Unfortunately same behavior as it reads just the first level directory files with the above suggestion

+1

wilsonshamim
StreamSets Employee
25 replies
Answer
3 years ago
October 20, 2021

@shivu2483 : whats the transformer version? also can you confirm if you have parquet files in dirone aswell?

shivu2483
Author
Fan
4 replies
3 years ago
October 20, 2021

@wilsonshamim Yes the dirone has the parquet files.

Transformer version is 4.1.0 (Scala 2.11) installed using streamsets DataOps platform.

+1

wilsonshamim
StreamSets Employee
25 replies
3 years ago
October 20, 2021

@shivu2483 : What’s the cluster the pipeline is configured (EMR/Databricks)? do you see any error in the driver log?

The configuration that I shared does work to me.

shivu2483
Author
Fan
4 replies
3 years ago
October 20, 2021

@wilsonshamim The cluster is local spark which comes packaged with the streamset transformer docker image.

There are no errors in the logs. The pipeline just reads the file on base directory and keeps running without reading any more parquet files. As you can see from below screenshot, the pipeline is running for 4 hours and read only 1000 records from the .parquet file on the base directory. I did start the pipeline as “Reset Origin and Start”

+1

wilsonshamim
StreamSets Employee
25 replies
3 years ago
October 20, 2021

@shivu2483 : Ok not sure how can I help further. if you have streamsets support, please reach out to support team.

I can confirm that the configuration I shared works well.

shivu2483
Author
Fan
4 replies
3 years ago
October 20, 2021

@wilsonshamim no worries, thanks a lot for looking into the issue.

Reply

Related topics

Just installed today's update - SecureAnywhere 9.1.4.123

WR 9.1.4.123 compatibility with Catalina 10.15.7icon

Catalina version 10.15.7 for Mac - is this compatible with SecureAnywhere for Mac 9.1.3.103?icon

Just installed today's update - SecureAnywhere 9.1.3.103

Can't update Webroot 9.1.0.144 on Mac Big Suricon

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded