Skip to main content

Hi Team,

I am trying to read the files from AWS S3 using the Amazon S3 origin. It does read the file at first level but doesn’t read the file recursively.

In the below image, for bucket “s3a://sotero-transformer/input/”, it reads the files userdata1.parquet and userdata2.parquet but doesn’t read the files under “dirone”.  

What should be the configuration to read all the files under S3 bucket recursively? 

 

 

 

Hi @shivu2483 :

can you try using Bucket url : s3a://sotero-transformer/input/* and Object name pattern as *


@wilsonshamim Thanks for looking into this issue. Unfortunately same behavior as it reads just the first level directory files with the above suggestion 


@shivu2483 : whats the transformer version? also can you confirm if you have parquet files in dirone aswell?

 

 


@wilsonshamim Yes the dirone has the parquet files. 

Transformer version is 4.1.0 (Scala 2.11) installed using streamsets DataOps platform.

 


@shivu2483 : What’s the cluster the pipeline is configured (EMR/Databricks)? do you see any error in the driver log?

The configuration that I shared does work to me.


@wilsonshamim The cluster is local spark which comes packaged with the streamset transformer docker image. 

There are no errors in the logs. The pipeline just reads the file on base directory and keeps running without reading any more parquet files. As you can see from below screenshot, the pipeline is running for 4 hours and read only 1000 records from the .parquet file on the base directory. I did start the pipeline as “Reset Origin and Start” 

 

 


@shivu2483 : Ok not sure how can I help further. if you have streamsets support, please reach out to support team. 

I can confirm that the configuration I shared works well.


@wilsonshamim no worries, thanks a lot for looking into the issue.


Reply