Overview
Before running StreamSets Data Collector (SDC) in a production environment, there are several configuration aspects an organization should review to ensure the safety and security of the Data Collector installation. This guide will walk through each of these areas and outline best practices and considerations.
These areas include:
- HTTPS Configuration
- User Identity Management
- Securing Credentials
- Limiting Scripting Stage Access
HTTPS Configuration
- SSL certificate and private key for the Data Collector’s fully qualified domain name (FQDN)
- openssl command
Creating a Java KeyStore
openssl pkcs12 -export -in sdc_company_com.pem -inkey sdc_company_com.key -out sdc_company_com.p12 -name sdc_company_com.p12
keytool -delete -keystore $SDC_CONF/keystore.jks -alias sdc
keytool -importkeystore -srckeystore sdc_company_com.p12 -srcstoretype PKCS12 -destkeystore $SDC_CONF/keystore.jks -deststoretype JKS
keytool -storepasswd -keystore $SDC_CONF/keystore.jks
# The private key password must match the keystore password in Java
keytool -keypasswd -alias <your cert alias> -keystore $SDC_CONF/keystore.jks
Configure Data Collector
User Identity Management
Out of the box, Data Collector ships with a file-based authentication scheme. While convenient for getting started, this does not offer adequate security for most organizations. The passwords are stored as un-salted MD5 hashes in a text file, which offers little security.
Most organizations will want to enable LDAP, SAML, or DPM integrations for authentication and authorization to Data Collector. The Data Collector User Guide section on User Authentication provides step by step instructions for configuring SDC to use these modes of authentication.
Identity Provider Specific Notes
LDAP
It is highly recommended that a secure LDAP port such as 636 or StartTLS is enabled to ensure that credentials are not transferred in the clear.
SAML
SAML Configuration Guide for DPM: https://cloud.streamsets.com/docs/index.html#UserGuide/OrganizationSecurity/OrgSecurity_title.html
DPM
DPM authentication is already configured in a secure manner when enabled when using the cloud hosted version on https://cloud.streamsets.com
Enable Access Control Lists (ACL) for Fine Grained Authorization
Securing Credentials
Because most Data Collector require access to external systems such as databases, web APIs, and cloud-hosted resources credentials are often a necessity when defining a pipeline. There are currently three ways to provide a pipeline with the necessary credentials.
- Directly in the pipeline configuration
- Using resource files to externalize them from the pipeline configuration
- An external secret management system such as Hashicorp Vault
Limiting Scripting Stage Access
The scripting stages available in SDC are powerful tools that can help achieve complex tasks when there is no suitable stage available out of the box. These currently fall into two types of stages:
- Scripting Processors — Allows writing scripts to perform record transformations and other tasks within the pipeline
- Script Executor — Allows forking a shell process to execute a script or program outside of SDC based on event records.
Scripting Processors
The Java Security Manager is enabled by default in Data Collector. This feature of Java provides the ability to define access control list (ACL) based security policies for code locations. These code locations may be a path on a filesystem or a sandbox such as a specific class loader.
JavaScript
The JavaScript engine in Java does not provide the ability to configure the security manager policy for user-provided scripts unless they are located in the filesystem. By default, it runs with very restrictive (safe) permissions. This processor does not need any further configuration to be used in a safe manner.
Groovy
The Groovy scripting processor is the most flexible and configurable scripting processor currently available. It provides a scope to define specific security manager policies allowing an administrator to limit or fully restrict filesystem or network access for example.
The following default block in the sdc-security.policy file can be augmented to suit your needs:
grant codebase "file:/groovy/script" {
permission java.lang.RuntimePermission "getClassLoader";
};
Jython
The Jython processor does not currently provide a way to configure a security policy and has very lax permissions out of the box. It is not recommended to install the Jython stage library in a production environment where you may have malicious users. If access to the environment is limited to trusted administrators only, this may be ok. Please consult the IT security teams to decide whether this option is right for you.
Script Executor
The Script Executor stage allows forking a shell process to execute a script. By default it ships with impersonation disabled and will execute processes as the same user that is running the SDC. This makes for easy testing and development, but is not recommended for production use.
Enabling impersonation in sdc.properties will run each specified script with sudo -u <logged in user so that only SDC users with local accounts on the system can run scripts, and only with the permissions that have been provided to their user account. It will also prohibit running any scripts as the SDC service user for safety reasons.
Filesystem Security Model
Key file locations and their common locations:
- Configuration Directory
- /etc/sdc (RPM, Docker Default)
- $SDC_HOME/etc (Tarball Default)
- Dynamically generated on each start in /var/run/cloudera-scm-agent/process (Cloudera Manager Installation)
- Data Directory
- /var/lib/sdc/data (Cloudera Manager Default)
- /data (Docker Default)
- $SDC_HOME/data (Tarball Default)
- Resources Directory
- /var/lib/sdc/resources (Cloudera Manager Default)
- /resources (Docker Default)
- $SDC_HOME/resources (Tarball Default)