Skip to main content
Solved

Data Transformer: AWS S3 Origin read all the files under S3 bucket recursively

  • October 18, 2021
  • 8 replies
  • 808 views

shivu2483
Fan

Hi Team,

I am trying to read the files from AWS S3 using the Amazon S3 origin. It does read the file at first level but doesn’t read the file recursively.

In the below image, for bucket “s3a://sotero-transformer/input/”, it reads the files userdata1.parquet and userdata2.parquet but doesn’t read the files under “dirone”.  

What should be the configuration to read all the files under S3 bucket recursively? 

 

 

 

Best answer by wilsonshamim

@shivu2483 : whats the transformer version? also can you confirm if you have parquet files in dirone aswell?

 

 

View original
Did this topic help you find an answer to your question?

8 replies

wilsonshamim
StreamSets Employee
Forum|alt.badge.img+1
  • StreamSets Employee
  • 25 replies
  • October 19, 2021

Hi @shivu2483 :

can you try using Bucket url : s3a://sotero-transformer/input/* and Object name pattern as *


shivu2483
Fan
  • Author
  • Fan
  • 4 replies
  • October 20, 2021

@wilsonshamim Thanks for looking into this issue. Unfortunately same behavior as it reads just the first level directory files with the above suggestion 


wilsonshamim
StreamSets Employee
Forum|alt.badge.img+1
  • StreamSets Employee
  • 25 replies
  • Answer
  • October 20, 2021

@shivu2483 : whats the transformer version? also can you confirm if you have parquet files in dirone aswell?

 

 


shivu2483
Fan
  • Author
  • Fan
  • 4 replies
  • October 20, 2021

@wilsonshamim Yes the dirone has the parquet files. 

Transformer version is 4.1.0 (Scala 2.11) installed using streamsets DataOps platform.

 


wilsonshamim
StreamSets Employee
Forum|alt.badge.img+1
  • StreamSets Employee
  • 25 replies
  • October 20, 2021

@shivu2483 : What’s the cluster the pipeline is configured (EMR/Databricks)? do you see any error in the driver log?

The configuration that I shared does work to me.


shivu2483
Fan
  • Author
  • Fan
  • 4 replies
  • October 20, 2021

@wilsonshamim The cluster is local spark which comes packaged with the streamset transformer docker image. 

There are no errors in the logs. The pipeline just reads the file on base directory and keeps running without reading any more parquet files. As you can see from below screenshot, the pipeline is running for 4 hours and read only 1000 records from the .parquet file on the base directory. I did start the pipeline as “Reset Origin and Start” 

 

 


wilsonshamim
StreamSets Employee
Forum|alt.badge.img+1
  • StreamSets Employee
  • 25 replies
  • October 20, 2021

@shivu2483 : Ok not sure how can I help further. if you have streamsets support, please reach out to support team. 

I can confirm that the configuration I shared works well.


shivu2483
Fan
  • Author
  • Fan
  • 4 replies
  • October 20, 2021

@wilsonshamim no worries, thanks a lot for looking into the issue.


Reply