Solved

Reading a pdf

1 year ago
July 9, 2023
6 replies
213 views

lex03
Discovered Fame
10 replies

I have a pdf on my local system and I want to read the text from it.

So I put directory as the origin and selected data format as the whole file and used a jython processor to read the file but it failed.

Can someone help me create a pipeline to read the text from the pdf...?

Best answer by Sanjeev

@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:

// Groovy will parse files in a different context, so we need to grant it additional privileges
grant codebase "file:/groovy/script" {
  permission java.lang.RuntimePermission "getClassLoader";
  permission java.util.PropertyPermission "*", "read";
  permission java.io.FilePermission "/tmp/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
};

you can also test by opening up the permissions to ALL

grant codebase "file:/groovy/script" {
  permission java.security.AllPermission;
};

Once the correct permissions were added to the security policy file the pipeline was working as expected

View original

Did this topic help you find an answer to your question?

Sanjeev
StreamSets Employee
53 replies
1 year ago
July 11, 2023

@lex03 This should be possible by using the https://pdfbox.apache.org/ library with the Groovy evaluator.

A simple pipeline would look like below:

Example Groovy code:

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper

def readPdfText(String filePath) {
    try {
        PDDocument document = PDDocument.load(new File(filePath))
        PDFTextStripper stripper = new PDFTextStripper()
        String text = stripper.getText(document)
        document.close()
        return text
    } catch (e) {
        sdc.log.error(e.toString(), e)
        return null
    }
}
records = sdc.records
for (record in records) {
    try {
         def filePath = record.value['/fileInfo/file']
         //def filePath = '/tmp/pdf-sample.pdf'
         def pdfText = readPdfText(filePath)
         record.value['pdfText'] = pdfText;
        // Write a record to the processor output
        sdc.output.write(record)
    } catch (e) {
        // Write a record to the error pipeline
        sdc.log.error(e.toString(), e)
        sdc.error.write(record, e.toString())
    }
}

lex03
Author
Discovered Fame
10 replies
1 year ago
July 11, 2023

@Sanjeev

Thanks for the solution, but I’m getting this error when I run this script.

Can you please help me out.

Sanjeev
StreamSets Employee
53 replies
1 year ago
July 11, 2023

@lex03 please make sure that you are installing the required jars for this libraries. Also, you’ll need to add permissions

Following jars should work. Also, it’ll be good to run the code outside of streamsets to see if it works standalone

lex03
Author
Discovered Fame
10 replies
1 year ago
July 13, 2023

Hi @Sanjeev

I have created the pipeline and added the exact same jars as mentioned by you in streamsets external libraries. Also I provided the required permissions to access the file. But the pipeline isn't working.

This is the error that I’m getting:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.text.PDFTextStripper
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

I also tried running this code outside streamsets with the exact same jar files and it is working fine there.

Can you please suggest me what can be the issue?

Thanks

Sanjeev
StreamSets Employee
53 replies
Answer
1 year ago
July 19, 2023

@lex03 in my case the problem was due to incorrect permissions. Here’s how it looks like for me:

// Groovy will parse files in a different context, so we need to grant it additional privileges
grant codebase "file:/groovy/script" {
  permission java.lang.RuntimePermission "getClassLoader";
  permission java.util.PropertyPermission "*", "read";
  permission java.io.FilePermission "/tmp/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/externalResources/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
  permission java.io.FilePermission "/opt/streamsets-datacollector-5.0.0/streamsets-libs-extras/streamsets-datacollector-groovy_2_4-lib/lib/-", "read";
};

you can also test by opening up the permissions to ALL

grant codebase "file:/groovy/script" {
  permission java.security.AllPermission;
};

Once the correct permissions were added to the security policy file the pipeline was working as expected

lex03
Author
Discovered Fame
10 replies
1 year ago
July 19, 2023

@Sanjeev

Thank you for the fix.

It worked for me.

This is the permission that had to be set on my side.

permission java.security.AllPermission;

Reply

Related topics

ReactNative - Payment sheet not showing trial periodicon

Introductory offer not showing in the payment sheeticon

Introductory Offer for a subscription is not shown for some countriesicon

Android trial period not showing in payment sheet for React Native appicon

iOS (Live) PurchaseSheet PayUpFront Introductory Offer with higher pricing not showing (all others work)icon

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded