Skip to main content

Avro schema mismatch

  • November 27, 2021
  • 0 replies
  • 532 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

Product: StreamSets Data Collector

 

Issue:-

While processing Avro schema and writing to the HDFS destination, it's creating the incorrect format at the destination (HDFS)

 

Avro Schema as input:

{
"type": "record",
"name": "avroschemawebmonitoringalert",
"namespace": "webmonitoringalert",
"doc": "Avro schema for webmonitoringalert table",
"fields": [{
"name": "webmonitoringalertid",
"type": "int"
}, {
"name": "alertid",
"type": "int"
}, {
"name": "dpxalertid",
"type": ["null", "long"],
"default": null
}, {
"name": "dpxuri",
"type": ["null", "string"],
"default": null
}, {
"name": "url",
"type": ["null", "string"],
"default": null
}, {
"name": "alertvalue",
"type": ["null", "string"],
"default": null
},
{
"name": "createdt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "updatedt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
}
]
}

Now we've two scenario 

  • Let say I've 4 MB Avro files in the origin and when in the HDFS destination I'm using Output Files Tab --> Max File Size (MB) as 0 (which is the default) at that time it is creating the Avro Schema as expected.
  • But when using Output Files Tab --> Max File Size (MB) as 1MB or 2MB in the HDFS destination, It is creating 4 or 2 files according to the size. In the case of 2MB it creates 2 files in the HDFS destination, the first file Avro schema is correct but the second file schema is incorrect.  In the case of 4MB, it creates 4 files, and the last two files schema is incorrect. 

Expected Schema:- 

{
"type": "record",
"name": "avroschemawebmonitoringalert",
"namespace": "webmonitoringalert",
"doc": "Avro schema for webmonitoringalert table",
"fields": [{
"name": "webmonitoringalertid",
"type": "int"
}, {
"name": "alertid",
"type": "int"
}, {
"name": "dpxalertid",
"type": ["null", "long"],
"default": null
}, {
"name": "dpxuri",
"type": ["null", "string"],
"default": null
}, {
"name": "url",
"type": ["null", "string"],
"default": null
}, {
"name": "alertvalue",
"type": ["null", "string"],
"default": null
},
{
"name": "createdt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
},
{
"name": "updatedt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
}
}
]
}

But when we're giving data size and it is creating the multiple AVRO file and it is adding the extra fields in the schema which is incorrect.

{
"type": "record",
"name": "avroschemawebmonitoringalert",
"namespace": "webmonitoringalert",
"doc": "Avro schema for webmonitoringalert table",
"fields": [{
"name": "webmonitoringalertid",
"type": "int"
}, {
"name": "alertid",
"type": "int"
}, {
"name": "dpxalertid",
"type": ["null", "long"],
"default": null
}, {
"name": "dpxuri",
"type": ["null", "string"],
"default": null
}, {
"name": "url",
"type": ["null", "string"],
"default": null
}, {
"name": "alertvalue",
"type": ["null", "string"],
"default": null
}, {
"name": "createdt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"logicalType": "timestamp-millis"
}, {
"name": "updatedt",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"logicalType": "timestamp-millis"
}]
}

Solution:

How to verify the Avro Schema for the HDFS Data?

We need to get the HDFS data to the local filesystem and need to verify using the below commands

hdfs dfs -get /<location>/<filename>.avro
$JAVA_HOMEbin/java -jar /opt/cloudera/parcels/CDH/jars/avro-tools-1.8.2-cdh6.3.0.jar getschema <filename>.avro

 

We can see we've two scenarios when we are using the data size in HDFS destination which is needed to create multiple files or divide the size of the Data files in the destination. At this time it'll create multiple files and create the incorrect schema which can create problems when processing data based on the Avro Schema.

So as a workaround, we need to use the Max File Size as 0, So in that case it'll create only one output file. For multiple file issues

This topic has been closed for replies.