Problem Description
When running an SDC pipeline which processes records with Strings containing UTF-8 special characters, these special characters are being replaced by question marks (? or �) in various parts of your pipeline.
Example
Input record:
{"productName": "My Product™"}
Output record:
{"productName": "My Product�"}
Root Cause
This problem indicates that your Java Runtime Environment is using a non-UTF-8 character set which does not support these special characters, so the special characters are being replaced by the � (U+FFFD) REPLACEMENT CHARACTER.
By default, SDC will try to set your JAVA_OPTS to use the UTF-8 encoding, however it is possible that the JAVA_OPTS parameters are being set outside of SDC and are overriding either of the following JVM parameters which tells the JRE what encoding to use at runtime: file.encoding and sun.jnu.encoding.
The JRE also has a default character set which it will use if these two parameters are not specified. This default character set can vary between JDK’s and also can also depend on the locale configured in your operating system.
Check what JRE character encoding is being used at SDC’s runtime
You can run this script to see what encoding is being set at SDC runtime:
# set your JAVA_OPTS and SDC_JAVA_OPTS env variables which are used when starting SDC
source $SDC_HOME/libexec/sdcd-env.sh
echo $SDC_JAVA_OPTS
source $SDC_HOME/libexec/sdc-env.sh
echo $SDC_JAVA_OPTS
export JAVA_OPTS="${SDC_JAVA_OPTS}"
echo $JAVA_OPTS
# run a test Java script which prints what encoding is being used by your JRE
pushd /tmp > /dev/null &&
echo 'import java.nio.charset.Charset;
class CharsetTest {
public static void main(String ] args) {
System.out.println("file.encoding:\t" + System.getProperty("file.encoding"));
System.out.println("sun.jnu.encoding:\t" + System.getProperty("sun.jnu.encoding"));
System.out.println("default charset:\t" + Charset.defaultCharset().displayName());
}
}' > CharsetTest.java && javac CharsetTest.java && java CharsetTest && popd > /dev/null
This may output something like:
file.encoding: ANSI_X3.4-1968
sun.jnu.encoding: ANSI_X3.4-1968
default charset: US-ASCII
indicating that your JRE is using ASCII rather than UTF-8.
Solution
To fix this issue, simply:
- Append the following parameters to your SDC_JAVA_OPTS in your $SDC_HOME/libexec/sdc-env.sh and $SDC_HOME/libexec/sdcd-env.sh files to force UTF-8 encoding in your JVM.:
-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
- You can check these parameters are being applied correctly by running the above script again.
-
Your output should now look like this (if the default character set still is not UTF-8, it shouldn’t matter as long as the other two parameters are set to UTF-8).file.encoding: UTF-8
sun.jnu.encoding: UTF-8
default charset: UTF-8
-
- Now you can restart your SDC service for these changes to be applied.