def readPdfText(String filePath) { try { PDDocument document = PDDocument.load(new File(filePath)) PDFTextStripper stripper = new PDFTextStripper() String text = stripper.getText(document) document.close() return text } catch (e) { sdc.log.error(e.toString(), e) return null } } records = sdc.records for (record in records) { try { def filePath = record.valuea'/fileInfo/file'] //def filePath = '/tmp/pdf-sample.pdf' def pdfText = readPdfText(filePath) record.valuea'pdfText'] = pdfText; // Write a record to the processor output sdc.output.write(record) } catch (e) { // Write a record to the error pipeline sdc.log.error(e.toString(), e) sdc.error.write(record, e.toString()) } }
@Sanjeev
Thanks for the solution, but I’m getting this error when I run this script.
Can you please help me out.
@lex03 please make sure that you are installing the required jars for this libraries. Also, you’ll need to add permissions
Following jars should work. Also, it’ll be good to run the code outside of streamsets to see if it works standalone
Hi @Sanjeev
I have created the pipeline and added the exact same jars as mentioned by you in streamsets external libraries. Also I provided the required permissions to access the file. But the pipeline isn't working.
This is the error that I’m getting:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.text.PDFTextStripper at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
I also tried running this code outside streamsets with the exact same jar files and it is working fine there.
Can you please suggest me what can be the issue?
Thanks
@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:
// Groovy will parse files in a different context, so we need to grant it additional privileges grant codebase "file:/groovy/script" { permission java.lang.RuntimePermission "getClassLoader"; permission java.util.PropertyPermission "*", "read"; permission java.io.FilePermission "/tmp/-", "read"; permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read"; permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read"; };
you can also test by opening up the permissions to ALL
grant codebase "file:/groovy/script" { permission java.security.AllPermission; };
Once the correct permissions were added to the security policy file the pipeline was working as expected
@Sanjeev
Thank you for the fix.
It worked for me.
This is the permission that had to be set on my side.