Skip to main content

Hi,

I’m using Hadoop FS standalone as origin in the pipeline. With read order last modified timestamp
The file directory is : /user/msc/*           File name pattern is : *

Under msc there are multiple folders and in these folders the hadoop will read all the files present, some functions are ran on those file and move the files to some other locations. The pipeline is working fine but sometimes I get a error like SPOOLDIR_01- failed to process file. Even though the file is read and processed , I’m getting this error.

I’m also getting a error like Running error : SPOOLDIR_35- spool directory runner failed reason java.io.Filenotfoundexception: file does not exist. After this the pipeline restarts itself.

 

Please help me out, if anyone know the reason.

 

Thanks,
Madhusudan

@msc “some functions are ran on those file and move the files to some other locations” - is this outside of the streamsets pipeline? The issue could be happening due to files being moved by an external process while it was also being queued by SDC to process


@Sanjeev in the pipeline itself, i have used shell block that calls some script. those script move the files from that folder


@msc are you using the shell script to archive the files after processing? I’m asking because you can configure that from within the pipeline from the ‘Post Processing’ tab and it will be a much cleaner solution


@Sanjeev yes i’m using shell script on the shell block. Those shell script read those file and perform some operations like moving and other operations


@msc if you are using the shell action to move/archive the files after the Hadoop FS origin reads them, then it’s better to use the archive options available on ‘Post Processing’ tab for the Hadoop FS origin as it eliminates the possibility of moving a file before it is completely processed by the origin and that’s quite possibly the issue you are running into. 


@Sanjeev , can u give more info on Post processing like whats the use case


@msc  please refer to step #4 in our docs. Also, perhaps sharing your pipeline will help me understand what you are trying to accomplish


@Sanjeev if i use the post processing, the files will be moved or deleted, but i don’t need this to happen.
The HDFS searches for files, using filename and path from hadoop as input to the shell, i’m moving the files form shell script


@msc I was suggesting to use ‘Post Processing’ option only if you were using the shell script for that. If you don’t want to move/archive the files after processing then that’s the default behavior. Again, it’s difficult to advice further without more details on the use-case


Reply