UTF-8 Special Characters being replaced by Question Marks (�)in Pipelines

  • 19 October 2022
  • 1 reply
  • 6611 views

Problem Description

When running an SDC pipeline which processes records with Strings containing UTF-8 special characters, these special characters are being replaced by question marks (? or �) in various parts of your pipeline.

Example

Input record:

{"productName": "My Product™"}

Output record:

{"productName": "My Product�"}

Root Cause

This problem indicates that your Java Runtime Environment is using a non-UTF-8 character set which does not support these special characters, so the special characters are being replaced by the � (U+FFFD) REPLACEMENT CHARACTER.

By default, SDC will try to set your JAVA_OPTS to use the UTF-8 encoding, however it is possible that the JAVA_OPTS parameters are being set outside of SDC and are overriding either of the following JVM parameters which tells the JRE what encoding to use at runtime: file.encoding and sun.jnu.encoding.

The JRE also has a default character set which it will use if these two parameters are not specified. This default character set can vary between JDK’s and also can also depend on the locale configured in your operating system.

Check what JRE character encoding is being used at SDC’s runtime

You can run this script to see what encoding is being set at SDC runtime:

# set your JAVA_OPTS and SDC_JAVA_OPTS env variables which are used when starting SDC
source $SDC_HOME/libexec/sdcd-env.sh
echo $SDC_JAVA_OPTS
source $SDC_HOME/libexec/sdc-env.sh
echo $SDC_JAVA_OPTS
export JAVA_OPTS="${SDC_JAVA_OPTS}"
echo $JAVA_OPTS

# run a test Java script which prints what encoding is being used by your JRE
pushd /tmp > /dev/null &&
echo 'import java.nio.charset.Charset;

class CharsetTest {

public static void main(String[] args) {
System.out.println("file.encoding:\t" + System.getProperty("file.encoding"));
System.out.println("sun.jnu.encoding:\t" + System.getProperty("sun.jnu.encoding"));
System.out.println("default charset:\t" + Charset.defaultCharset().displayName());
}

}' > CharsetTest.java && javac CharsetTest.java && java CharsetTest && popd > /dev/null

This may output something like:

file.encoding: ANSI_X3.4-1968
sun.jnu.encoding: ANSI_X3.4-1968
default charset: US-ASCII

indicating that your JRE is using ASCII rather than UTF-8.

Solution

To fix this issue, simply:

  1. Append the following parameters to your SDC_JAVA_OPTS in your $SDC_HOME/libexec/sdc-env.sh and $SDC_HOME/libexec/sdcd-env.sh files to force UTF-8 encoding in your JVM.:
    • -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 
  2. You can check these parameters are being applied correctly by running the above script again.
    • file.encoding: UTF-8
      sun.jnu.encoding: UTF-8
      default charset: UTF-8
      Your output should now look like this (if the default character set still is not UTF-8, it shouldn’t matter as long as the other two parameters are set to UTF-8).
  3. Now you can restart your SDC service for these changes to be applied.

1 reply

Hi @john.mcavoy,

 

I have exact the same problem. But we are running our SDC on the “standard” docker image from Streamsets. 

I was able to execute only the first lines of your script (no javac):
 

source $SDC_HOME/libexec/sdcd-env.sh
echo $SDC_JAVA_OPTS
-Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true
source $SDC_HOME/libexec/sdc-env.sh
OPTS="${SDC_JAVA_OPTS}"
echo $JAVA_OPTSsdc@use1preux3712:/$ echo $SDC_JAVA_OPTS
-Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow -Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true
export JAVA_OPTS="${SDC_JAVA_OPTS}"
echo $JAVA_OPTS
-Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow -Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 -Dlog4j2.formatMsgNoLookups=true

So as you can see the encoding is set to UTF8. Do you have any other idea?

Regards Sebastian

Reply