Skip to main content

How to handle multi delimited records?

  • December 20, 2021
  • 0 replies
  • 93 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

Scenario:

Data record contains multiple delimited characters acting as a single delimiter. This makes it difficult to directly parse those records directly by specifying the multi delimiter in the stage configuration.

 

Goal:

To be able to parse multi delimited character records using SDC stages without having to do data cleansing prior to letting the data flow through the pipeline.

 

Solution:

The following sample pipeline should help to achieve that end.

Directory Origin => Expression Evaluator => Data Parser => Local FS

In the Expression Evaluator stage, you'll need to apply the following Field Expression

${str:replaceAll(record:value('/text'), "\\|\\~\\|", "^")}

The assumption here is that there is a character that will never show up in the incoming data and use that as the alternate separator being set in the Expression Evaluator. Here. ^ is used for the purpose of the example.


So, the above pipeline consumes a directory origin using text format, then the expression evaluator replaces all |~| occurrences with ^, then the data-parser parses delimited using ^ as a delimiter.

To improve one's chances of not encountering a character commonly seen as delimiters, one could make use of a rare Unicode character like \u2603 as the replacement of the multi delimiter shown above, i.e, use \u2603 instead of ^.

Did this topic help you find an answer to your question?
This topic has been closed for comments