Solved

Reading a pdf

  • 9 July 2023
  • 6 replies
  • 161 views

I have a pdf on my local system and I want to read the text from it.

So I put directory as the origin and selected data format as the whole file and used a jython processor to read the file but it failed.

Can someone help me create a pipeline to read the text from the pdf...?

icon

Best answer by Sanjeev 19 July 2023, 04:28

View original

6 replies

Userlevel 2
Badge

@lex03 This should be possible by using the https://pdfbox.apache.org/ library with the Groovy evaluator.

A simple pipeline would look like below:

Example Groovy code:

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper

def readPdfText(String filePath) {
try {
PDDocument document = PDDocument.load(new File(filePath))
PDFTextStripper stripper = new PDFTextStripper()
String text = stripper.getText(document)
document.close()
return text
} catch (e) {
sdc.log.error(e.toString(), e)
return null
}
}
records = sdc.records
for (record in records) {
try {
def filePath = record.value['/fileInfo/file']
//def filePath = '/tmp/pdf-sample.pdf'
def pdfText = readPdfText(filePath)
record.value['pdfText'] = pdfText;
// Write a record to the processor output
sdc.output.write(record)
} catch (e) {
// Write a record to the error pipeline
sdc.log.error(e.toString(), e)
sdc.error.write(record, e.toString())
}
}

 

@Sanjeev 

Thanks for the solution, but I’m getting this error when I run this script.

 Can you please help me out.

Userlevel 2
Badge

@lex03 please make sure that you are installing the required jars for this libraries. Also, you’ll need to add permissions 

Following jars should work. Also, it’ll be good to run the code outside of streamsets to see if it works standalone

 

Hi @Sanjeev 

I have created the pipeline and added the exact same jars as mentioned by you in streamsets external libraries. Also I provided the required permissions to access the file. But the pipeline isn't working.

This is the error that I’m getting:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.text.PDFTextStripper
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

I also tried running this code outside streamsets with the exact same jar files and it is working fine there.

Can you please suggest me what can be the issue?

Thanks

 

Userlevel 2
Badge

@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:

// Groovy will parse files in a different context, so we need to grant it additional privileges
grant codebase "file:/groovy/script" {
permission java.lang.RuntimePermission "getClassLoader";
permission java.util.PropertyPermission "*", "read";
permission java.io.FilePermission "/tmp/-", "read";
permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
};

  you can also test by opening up the permissions to ALL

grant codebase "file:/groovy/script" {
permission java.security.AllPermission;
};

Once the correct permissions were added to the security policy file the pipeline was working as expected

@Sanjeev 

Thank you for the fix.

It worked for me.  

This is the permission that had to be set on my side.

permission java.security.AllPermission;

Reply