Skip to main content

Swayam's Micro-blogging: Convert CSV to JSON

  • 4 September 2021
  • 2 replies
  • 570 views

Dear Community Users,

I’ve been using StreamSets OSS for few months now and trying to come up with different ways how I can use it in my Organization for various ETL and Reverse ETL use cases. During this journey - we have come up with many small/medium exercises and I beleive sharing them with community may help others to save some time. This is the very first item in my micro-blogging series where I’m sharing how we are converting a CSV record into a JSON. 

 

Problem Stement:

Convert records coming in a CSV file to JSON before publishing them to a target destination.

Input Data Format:

NAME|AGE|DEPARTMENT|ADDRESS
"John"|"23"|"HR"|"31, 2nd Cross, 4th Street, Wilson-543234"

Output Data Format:

{
"data": {
"NAME": "John",
"AGE": "23",
"DEPARTMENT": "HR",
"ADDRESS": "31, 2nd Cross, 4th Street, Wilson-543234"
}
}

 

  • Origin: A DEV Data Genrator that has records as below. We have set the data format to “delimiter” with custom delimiter set to pipe (|).
NAME|AGE|DEPARTMENT|ADDRESS
"John"|"23"|"HR"|"31, 2nd Cross, 4th Street, Wilson-543234"
"Amanda"|"26"|"HR"|"24, 1st Cross, Wilson Garden-543234"
  • Jython Evaluator: This is where we are wrapping every record inside a “/data” list map.
for record in records:
try:
# Save existing value, create new map, put saved value in the map
tmp = record.value
record.value = sdcFunctions.createMap(True)
record.value 'data'] = tmp

# Write record to processor output
output.write(record)

except Exception as e:
# Send record to error
error.write(record, str(e))
  • JSON Generator: Serializing the root element and storing them under “/data” filed.

 

As you can see below - a preview of the pipeline does demonstrate that the input CSV is now converted into a JSON.

 

Conclusion:

This is simple but worth to note that sometime these little tricks saves a lot of time for developers. Same problem statement can be addressed using StreamSets existing stage processor (without using external script evaluator) or probably converting Jython into Groovy as Groovy is more performant that Jython. If you want to contribute using these additional tricks for the same problem statement, please do comment and contribute to the community.

Hi @swayam,

Thanks for taking the time and sharing this! Out of curiosity, was there a destination that you came across/wanted to use where storing the data in JSON format out-of-the-box via Data Format tab on the destination wasn’t an option? 

Cheers,

Dash


Hi @Dash ,

While most of the destinations allows the “Data Format” JSON, my intent in the post was to share 2 items:

  • In most ETL cases, we have predictable input where we know the field names in advance. If there is a need for us to wrap all elements inside a list-map “/data”, it can be achieved well by using the out-of-box processors. However, there are use cases where we don’t know the column names in advance (they may differ from one file to another). In such cases, external evaluator (here jython) may be helpful to iterate over all columns in a record and wrap them inside the list-map “/data”.
  • Once we have the record elements wrapped around “/data” - if we need to directly send to a destination, then we don’t need the “JSON Generator” as someone can always select JSON as “Data Format” for the specified destination. However - I used that in my example because I wanted to extend this pipeline in my following blog by using another jython processor where I combined all records in a batch into a second record.

Regards

Swayam


Reply