Skip to main content
Solved

Reading a pdf


lex03
Discovered Fame
  • Discovered Fame
  • 10 replies

I have a pdf on my local system and I want to read the text from it.

So I put directory as the origin and selected data format as the whole file and used a jython processor to read the file but it failed.

Can someone help me create a pipeline to read the text from the pdf...?

Best answer by Sanjeev

@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:

// Groovy will parse files in a different context, so we need to grant it additional privileges
grant codebase "file:/groovy/script" {
  permission java.lang.RuntimePermission "getClassLoader";
  permission java.util.PropertyPermission "*", "read";
  permission java.io.FilePermission "/tmp/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
};

  you can also test by opening up the permissions to ALL

grant codebase "file:/groovy/script" {
  permission java.security.AllPermission;
};

Once the correct permissions were added to the security policy file the pipeline was working as expected

View original
Did this topic help you find an answer to your question?

6 replies

Sanjeev
StreamSets Employee
Forum|alt.badge.img
  • StreamSets Employee
  • 53 replies
  • July 11, 2023

@lex03 This should be possible by using the https://pdfbox.apache.org/ library with the Groovy evaluator.

A simple pipeline would look like below:

Example Groovy code:

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper

def readPdfText(String filePath) {
    try {
        PDDocument document = PDDocument.load(new File(filePath))
        PDFTextStripper stripper = new PDFTextStripper()
        String text = stripper.getText(document)
        document.close()
        return text
    } catch (e) {
        sdc.log.error(e.toString(), e)
        return null
    }
}
records = sdc.records
for (record in records) {
    try {
         def filePath = record.value['/fileInfo/file']
         //def filePath = '/tmp/pdf-sample.pdf'
         def pdfText = readPdfText(filePath)
         record.value['pdfText'] = pdfText;
        // Write a record to the processor output
        sdc.output.write(record)
    } catch (e) {
        // Write a record to the error pipeline
        sdc.log.error(e.toString(), e)
        sdc.error.write(record, e.toString())
    }
}

 


lex03
Discovered Fame
  • Author
  • Discovered Fame
  • 10 replies
  • July 11, 2023

@Sanjeev 

Thanks for the solution, but I’m getting this error when I run this script.

 Can you please help me out.


Sanjeev
StreamSets Employee
Forum|alt.badge.img
  • StreamSets Employee
  • 53 replies
  • July 11, 2023

@lex03 please make sure that you are installing the required jars for this libraries. Also, you’ll need to add permissions 

Following jars should work. Also, it’ll be good to run the code outside of streamsets to see if it works standalone

 


lex03
Discovered Fame
  • Author
  • Discovered Fame
  • 10 replies
  • July 13, 2023

Hi @Sanjeev 

I have created the pipeline and added the exact same jars as mentioned by you in streamsets external libraries. Also I provided the required permissions to access the file. But the pipeline isn't working.

This is the error that I’m getting:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.text.PDFTextStripper
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

I also tried running this code outside streamsets with the exact same jar files and it is working fine there.

Can you please suggest me what can be the issue?

Thanks

 


Sanjeev
StreamSets Employee
Forum|alt.badge.img
  • StreamSets Employee
  • 53 replies
  • Answer
  • July 19, 2023

@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:

// Groovy will parse files in a different context, so we need to grant it additional privileges
grant codebase "file:/groovy/script" {
  permission java.lang.RuntimePermission "getClassLoader";
  permission java.util.PropertyPermission "*", "read";
  permission java.io.FilePermission "/tmp/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
};

  you can also test by opening up the permissions to ALL

grant codebase "file:/groovy/script" {
  permission java.security.AllPermission;
};

Once the correct permissions were added to the security policy file the pipeline was working as expected


lex03
Discovered Fame
  • Author
  • Discovered Fame
  • 10 replies
  • July 19, 2023

@Sanjeev 

Thank you for the fix.

It worked for me.  

This is the permission that had to be set on my side.

permission java.security.AllPermission;


Reply