Skip to main content

Data Collector Hardening Guide


Drew Kreiger
Rock star
Forum|alt.badge.img

Overview

Before running StreamSets Data Collector (SDC) in a production environment, there are several configuration aspects an organization should review to ensure the safety and security of the Data Collector installation. This guide will walk through each of these areas and outline best practices and considerations.

These areas include:

  • HTTPS Configuration
  • User Identity Management
  • Securing Credentials
  • Limiting Scripting Stage Access

HTTPS Configuration

In order to ensure secured communications with the Data Collector UI and REST API, it is necessary to enable HTTPS.
 
Prerequisites:
  • SSL certificate and private key for the Data Collector’s fully qualified domain name (FQDN)
  • openssl command
 
If you already have a Java KeyStore (JKS) file containing the SSL certificate and private key, may skip on to Configuring SDC for HTTPS

Creating a Java KeyStore

In most cases, certificates are issued in PEM format. It is not possible to import this format with a private key directly into a Java KeyStore. First, it must be converted to p12 format using the openssl command.
 
Given a certificate sdc_company_com.pem and private key sdc_company_com.key the following command will convert it to a p12 key store. You may be prompted to create a password, since this is a temporary file we will erase you can use any password except a blank one. Using a blank password may not allow proper import/export into the Java key store.
openssl pkcs12 -export -in sdc_company_com.pem -inkey sdc_company_com.key -out sdc_company_com.p12 -name sdc_company_com.p12
 
Next, delete the self-signed certificate from the default Data Collector key store file located in $SDC_CONF/keystore.jks with the following command:
keytool -delete -keystore $SDC_CONF/keystore.jks -alias sdc
The default password is password and is located in the keystore-password.txt file.
 
Now, you are ready to import the p12 key store entries into the JKS key store.
keytool -importkeystore -srckeystore sdc_company_com.p12 -srcstoretype PKCS12 -destkeystore $SDC_CONF/keystore.jks -deststoretype JKS
 
Note: You can change the password of the key store file with the following command:
keytool -storepasswd -keystore $SDC_CONF/keystore.jks
# The private key password must match the keystore password in Java
keytool -keypasswd -alias <your cert alias> -keystore $SDC_CONF/keystore.jks

Configure Data Collector

The final step is to configure sdc.properties with the appropriate settings. Change https.port to the port you wish to use for HTTPS connections, for example 18631. Any insecure requests made on the http.port will be redirected to the secure port automatically.

User Identity Management

Out of the box, Data Collector ships with a file-based authentication scheme. While convenient for getting started, this does not offer adequate security for most organizations. The passwords are stored as un-salted MD5 hashes in a text file, which offers little security.

Most organizations will want to enable LDAP, SAML, or DPM integrations for authentication and authorization to Data Collector. The Data Collector User Guide section on User Authentication provides step by step instructions for configuring SDC to use these modes of authentication.

Identity Provider Specific Notes

LDAP

It is highly recommended that a secure LDAP port such as 636 or StartTLS is enabled to ensure that credentials are not transferred in the clear.

SAML

SAML Configuration Guide for DPM: https://cloud.streamsets.com/docs/index.html#UserGuide/OrganizationSecurity/OrgSecurity_title.html

DPM

DPM authentication is already configured in a secure manner when enabled when using the cloud hosted version on https://cloud.streamsets.com

 

Enable Access Control Lists (ACL) for Fine Grained Authorization

Enabling ACLs will allow you to set fine grained permissions controls on pipelines as it may be desirable to keep some pipelines out of view from some subset of users versus others.
 
For further information on ACLs see the SDC documentation chapter on Pipeline Permissions here: https://streamsets.com/documentation/datacollector/latest/help/index.html#Configuration/RolesandPermissions.html#concept_i1p_hzd_yy
 

Securing Credentials

Because most Data Collector require access to external systems such as databases, web APIs, and cloud-hosted resources credentials are often a necessity when defining a pipeline. There are currently three ways to provide a pipeline with the necessary credentials.

  • Directly in the pipeline configuration
  • Using resource files to externalize them from the pipeline configuration
  • An external secret management system such as Hashicorp Vault

Limiting Scripting Stage Access

The scripting stages available in SDC are powerful tools that can help achieve complex tasks when there is no suitable stage available out of the box. These currently fall into two types of stages:

  • Scripting Processors — Allows writing scripts to perform record transformations and other tasks within the pipeline
  • Script Executor — Allows forking a shell process to execute a script or program outside of SDC based on event records.

Scripting Processors

The Java Security Manager is enabled by default in Data Collector. This feature of Java provides the ability to define access control list (ACL) based security policies for code locations. These code locations may be a path on a filesystem or a sandbox such as a specific class loader.

JavaScript

The JavaScript engine in Java does not provide the ability to configure the security manager policy for user-provided scripts unless they are located in the filesystem. By default, it runs with very restrictive (safe) permissions. This processor does not need any further configuration to be used in a safe manner.

Groovy

The Groovy scripting processor is the most flexible and configurable scripting processor currently available. It provides a scope to define specific security manager policies allowing an administrator to limit or fully restrict filesystem or network access for example.

The following default block in the sdc-security.policy file can be augmented to suit your needs:

grant codebase "file:/groovy/script" {
  permission java.lang.RuntimePermission "getClassLoader";
};

Jython

The Jython processor does not currently provide a way to configure a security policy and has very lax permissions out of the box. It is not recommended to install the Jython stage library in a production environment where you may have malicious users. If access to the environment is limited to trusted administrators only, this may be ok. Please consult the IT security teams to decide whether this option is right for you.

Script Executor

The Script Executor stage allows forking a shell process to execute a script. By default it ships with impersonation disabled and will execute processes as the same user that is running the SDC. This makes for easy testing and development, but is not recommended for production use.

Enabling impersonation in sdc.properties will run each specified script with sudo -u <logged in user  so that only SDC users with local accounts on the system can run scripts, and only with the permissions that have been provided to their user account. It will also prohibit running any scripts as the SDC service user for safety reasons.

Filesystem Security Model

StreamSets Data Collector stores some data on local disk. This data includes application configuration, pipelines and their state, and logs. Each location is configurable and accessible via the Settings > SDC Directories menu item in the SDC UI.
 

Key file locations and their common locations:

  • Configuration Directory
    • /etc/sdc (RPM, Docker Default)
    • $SDC_HOME/etc (Tarball Default)
    • Dynamically generated on each start in /var/run/cloudera-scm-agent/process (Cloudera Manager Installation)
  • Data Directory
    • /var/lib/sdc/data (Cloudera Manager Default)
    • /data (Docker Default)
    • $SDC_HOME/data (Tarball Default)
  • Resources Directory
    • /var/lib/sdc/resources (Cloudera Manager Default)
    • /resources (Docker Default)
    • $SDC_HOME/resources (Tarball Default)
 
Configuration Directory
The configuration directory contains configuration files used for controlling all configuration parameters as well as policy file for the Java Security Manager. Configuration files are typically owned by the service user used to run SDC (often the user ‘sdc’) and default to file mode 644. Sensitive information is by default retrieved from external files such as email-password.txt and referenced in the configuration files. These files such as email-password.txt are expected to be at most owner read-write (600). In the case of Cloudera Manager installations, config files are not stored in a permanent location and are generated on the fly at each service start and managed by Cloudera Manager itself. When kerberos is enabled for SDC, the keytab is stored in the configuration directory.
 
Data Directory
The data directory is where pipelines are stored as well as the pipeline state and pipeline state history data. This location should not be writable by other users except for the SDC service user.
 
Resources Directory
The resources directory contains resource files provided by system administrators for users to access in their pipelines. For example, the resources directory often contains Hadoop client configuration files so that stages like the HDFS Destination may function. This location is not generally writable by users, but the security manager policy may be modified to permit certain access by user scripts if necessary.
 
-Adam Kunicki
October 28, 2020 13:21
Did this topic help you find an answer to your question?

Reply