Can I convert JSON to CSV (delimited) data?

  • 17 January 2022
  • 1 reply
  • 303 views

Userlevel 4
Badge

It's possible to convert a data format that supports complex fields such as JSON to a flat file format such as CSV, but it can often get tricky very fast.

Consider a JSON example of:

{ "first_name": "Mary", 
"last_name": "Jane"
}

Seems simple enough. This JSON actually resembles a flat-file format as there is no nesting of fields. We could flatten this into a CSV that looks like:

first_name,last_name
Mary,Jane

However, this gets a little more complicated if we had the following:

{
"name": {
"first": "Mary",
"last": "Jane"
}
}

Now we must take each level of nesting and "flatten" it by creating some kind of composite field name. This is still doable, with a recursive function such as the one below that can be used in the Jython Evaluator Processor.

def flatten_dict(d):
def items():
for key, value in d.items():
if isinstance(value, dict):
for subkey, subvalue in flatten_dict(value).items():
yield key + "." + subkey, subvalue
else:
yield key, value

return dict(items())

The final case to account for is the most difficult, where you may still have arrays in your final data, another complex field type. Arrays are harder to flatten because they don't have a natural key (field name) and you must generate one. It's possible to use the field name of the array itself plus a number.

The best solution really as you may have already figured out is to use a more complex data-friendly format such as Avro, which is also supported in StreamSets and by the vast majority of the Hadoop ecosystem.


1 reply

Hi Akshay,

Check out this thread where the same topic is discussed and a solution is found https://groups.google.com/a/streamsets.com/g/sdc-user/c/oMANqyP3KQA . The initial solution they propose is using a Field Flattener, but it gets a little more complicated after that!

Regards, Mikel.

Reply