Unable to Read Data Using S3origin in Data Transformer

  • 6 September 2021
  • 3 replies

Hi Team,

I am facing  issue i.e. reading data through S3origin within data transformer. I am able to read data through s3 origin in data collector.

Trying to read data from S3 origin and copy the same in different location using S3 destination. I am using EMR as an computing engine. Job runs for several mins. on EMR and completed successfully. There is no error in Logs (Both EMR and StreamSet pipeline Logs). Do get this below Warning but not sure this is causing issue or not. java.nio.file.NoSuchFileException: /data/transformer/runInfo/testRun__9e731964-6f21-4956-99fa-82206f3451f5__149e11c1-f697-11eb-b9dc-fd846d33049d__56e36c1c-f8c6-11eb-9295-0fa62e75e081@149e11c1-f697-11eb-b9dc-fd846d33049d/run1630923519827/driver-topLevelError.log

I have verified staging directory as well. seems like all required files are getting populated there which eventually being read through Spark submit. 

At the end, Transformer pipeline ends with status START_ERROR: Job completed successfully.

This is a show stopper as of now as its seems to be very basic issue. 

Appreciate any resolution and pointers to proceed further. 


Best answer by Giuseppe Mura 7 September 2021, 17:15

View original

3 replies

Userlevel 3

Hi @ankit, are you able to preview the data? 
Also, are you able to run a very simple Dev Origin → Trash pipeline?


Hi, I was not able to preview, was facing Unknowhost Exception error. However, this actualissue has been resolved.. I followed below steps to resolve this.

  1. Open Port between EMR/EC2 and Docker image 
  2. Initiated data transformer engine after verifying the compatibility check between data transformer Scala version and EMR scala  version . 

These two steps resolved the issue. 

However, I am still unable to see output in preview, its just showing me the blank but data has been copied from origin to destination path.

Userlevel 3

hi @Ankit, what you need to check is that the Spark cluster is able to “talk” back to your Transformer engine; in your pipeline, edit the Cluster Callback URL property with your EC2 instance’s hostname, e.g. something like:

If you have a tarball install this is not really required as Transformer would automatically take the hostname from the machine, but given that you’re running in Docker, Transformer ends up using your container name, which is not reachable from your EMR cluster

Also find the documentation for this here: