Skip to main content

How to format Excel files before writing to endpoint?

  • December 22, 2021
  • 0 replies
  • 528 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

Product: All SDC versions that support Data Format Excel

Scenario:

We have an excel sheet that is an input as Whole file to Directory / HDFS origin. We want to run some basic formatting on all the worksheets and eliminate some unwanted data before writing to the endpoint.

Goal:

Format Excel files before loading to endpoint

Data Format : Excel

i. Eliminate the first 2 rows (completely ignore) and process from the 3rd row
ii. Pick column values A,B,C and M for first row and D,E,F,I,J,K,L columns from row 2 onwards.

The above is a screenshot of one of the worksheets in the SampleExcel2.xlsx file attached to this article

What we want to do here is to skip rows 1,2 and columns A,B,C in all worksheets and write the remaining columns and rows to the endpoint

Solution:

Please refer to attachments to this KB which has a sample input sheet, screenshots of sample output, and 2 example pipeline JSON files to handle this.

 

There are 2 ways of handling this. One is using the groovy script and the other is using an EL evaluator. I have uploaded the sample pipeline for both methods.

 

+    We integrated a groovy script and saw it was working for a single sheet.
+    Here is a code change you need to make to your current groovy evaluator to make it work for

multiple sheets

 Groovy Evaluator ->> Groovy

 Init Script :

sdc.state['currentFile'] = ""
 sdc.state['worksheet'] = ""
 sdc.state['lineNum'] = 0

Script :

// Sample Groovy code
 records = sdc.records
 for (record in records) {
    try {

      // initialize, so we work correctly on the first file.
      // do this for the first record of the first file only.
      if(sdc.state['currentFile'].equals("")) {
        sdc.state['currentFile'] = record.attributes['file'];
      }
      if(sdc.state['worksheet'].equals("")) {
        sdc.state['worksheet'] = record.attributes['worksheet'];
      }

      if(sdc.state['currentFile'].equals(record.attributes['file'])) {
        if (sdc.state['worksheet'].equals(record.attributes['worksheet']))
        {
if(sdc.state['lineNum'] < 2) {  // number of records to skip.  <<<<<<<<<<<<< THIS WILL CHANGE BASED ON NUMBER OF LINES YOU WANT TO SKIP
          ++sdc.state['lineNum'];
            sdc.log.info("CASE 13303: currentFile {} worksheet {} lineNum {}  " ,sdc.state['currentFile'],
                        sdc.state['worksheet'], sdc.state['lineNum']);
          continue;  // skip this record

          }
        } else {
          sdc.state['worksheet'] = record.attributes['worksheet'];
          sdc.state['lineNum'] = 1;
            continue;  // skip this record
        }
      } else {
        sdc.log.info("CASE 13303: {} set to {}", sdc.state['currentFile'], record.attributes['file']);
        sdc.state['currentFile'] = record.attributes['file'];
        sdc.state['worksheet'] = record.attributes['worksheet'];
        sdc.state['lineNum'] = 0;
      }
      // Write a record to the processor output
      sdc.output.write(record);
    } catch (e) {
        // Write a record to the error pipeline
        sdc.log.error(e.toString(), e)
        sdc.error.write(record, e.toString())
    }
 }

 

Did this topic help you find an answer to your question?
This topic has been closed for comments