StreamSets Data Collector and Transformer 4.0 Overview & FAQ

  • 26 November 2021
  • 1 reply
  • 1227 views

Userlevel 4
Badge

What’s Happening?

Overview

We are going up a major version in our data plane releases in our subsequent releases of the data plane engines, expected in the May/June time frame, from 3.x to 4.x.

  • Data Collector 4.0 is expected to release May 25, 2021. The final feature release of Data Collector 3.x was 3.22.2 which was released on May 4, 2021.
  • Transformer 4.0 is expected to release sometime in mid-June 2021. The final feature release of Transformer 3.x was 3.18 which was released on March 4, 2021. 
  • Updated Enterprise Stage Libraries for Snowflake and Databricks will be released in June 2021.

What is new in the Data Collector and Transformer 4.0 releases?

The key features/changes in Data Collector 4.0 are:

  • Additional connectors supported for use with Connection Catalog, including SQL Server and Oracle

The key features in Transformer 4.0 are:

  • Support for Databricks 7.0+ (on JDK 11)
  • Support for EMR 6.1+ (on JDK 11)
  • Redshift branded origin
  • Transformer Job Failover for Databricks
  • Support bootstrap actions in EMR
  • Additional connectors supported for use with Connection Catalog, including SQL Server, Oracle and Postgres

What else is changing in the Data Collector and Transformer 4.0 releases?

There are five other changes that are coinciding with the 4.0 releases.

  1. Data Collector will no longer be open source software starting with 4.0. It will be a closed source. However, we will continue to offer free options for community users.
  2. There are 3 origins that were deprecated several years ago in Data Collector 3.x which will be removed in 4.0. 
    • HTTP to Kafka
    • SDC RPC to Kafka
    • UDP to Kafka
  3. There are several features that are being deprecated as of Data Collector 4.0. We will continue to support these features for the lifespan of Data Collector 4.x, but they will be removed in Data Collector 5.x. 
    • Local user interface
    • Certain Data Collector stages (see full list below)
    • Cluster mode execution
  4. We are deprecating the local user interface in Transformer 4.0.
  5. We are updating the support policy for all our products, which will be effective as of May 25, 2021.

Are the enterprise stage libraries for Data Collector and Transformer going to change version numbers to 4.x?

No, they will remain on their current versioning schemes (1.x) and increment as minor releases.

 

Can I continue to use existing 3.x versions of Data Collector and Transformer? If so, for how long? Is there a hard EOL/EOSL?

Customers can continue using 3.x versions of Data Collector and Transformer. The last versions of each, Data Collector 3.22.2 and Transformer 3.18, will be fully supported until June 30, 2023.

However, we encourage all our customers to upgrade their Data Collector and Transformer engines to 4.x as soon as practical, as new features and enhancements will only be made on the 4.x versions. Data Collector 4.x and Transformer 4.x are backward compatible with ControlHub 3.x.

 

What are the benefits?

Why are the products going up a major release version? What are the benefits of 4.0 Data Collector and Transformer?

We are releasing the latest 4.0 versions of Data Collector and Transformer, which will be released in late May and mid-June respectively. This is a major release (hence the jump from 3.x to 4.0) because:

  • Data Collector: We are shifting our focus increasingly to the cloud. To support that, we are focusing on expanding cloud support and will no longer have Data Collector available as open source, which necessitates a major release version. While Data Collector 4.0 itself contains only minor new features, we are planning many ongoing improvements to Data Collector on the 4.x version line, as well as the enterprise stage libraries. For example, we are releasing improved Databricks and Snowflake enterprise stage libraries in June.
  • Transformer: There are significant new features in Transformer, with improved support for AWS and Databricks, which we are excited to deliver in our 4.0 release.

Why are you making these changes? 

As our customers increasingly adopt cloud platforms and deployment models, we are focused on ensuring we provide the best data engineering platform for the cloud. There are several changes and investments we are making to ensure we offer the best experience for data engineers building a pipeline to cloud platforms:

  • We are shifting the focus of how we support our community to improving our cloud service, rather than on the open source. In order to focus on the cloud, we are actually close sourcing Data Collector, which necessitates a major release version. 
  • There are significant new features in Transformer, with improved support for AWS and Databricks, which we are excited to deliver in our 4.0 release. 
  • There are new improvements in the enterprise stage libraries for Databricks and Snowflake.
  • We are releasing significant new functionality over the course of 2021 to help our customers deploy and operate more easily in hybrid and multi-cloud environments. There are some framework-level changes that have been implemented in Data Collector and Transformer 4.0 which will enable easier deployment in cloud environments in future releases.

 

How are upgrades going to work?

What does this mean for upgrades? 

The upgrade process for Data Collector and Transformer from version 3.x to 4.x will work exactly like it has from a lower version of 3.x to a higher version of 3.x. Despite the jump in major version numbers, the upgrade process should be no more complex than the minor release upgrade process for the vast majority of our customers.

There is a small impact on customers with Customer Managed Control Hub (previously referred to  as “On Prem SCH”).  For Control Hub versions below or equal to 3.18.x: the administrator for the organization needs to navigate to the Control Hub UI -> Administration -> Data Collectors -> Component version range and update the value 3.99.999 to 4.99.999.

For Control Hub version 3.19.x (November 2020), 3.20.x (December 2020), and 3.21.0 (April 2021): Customers on either of these Control Hub versions will require an updated jar from StreamSets as the max Data Collector version can no longer be configured from the user interface. Also, connections can’t be created on either of these versions when Data Collector 4.x is used as the authoring Data Collector. Please open a support ticket and we will provide guidance on the solution.

 

We generally don’t upgrade to x.0 releases as they are often unstable. Should we wait until 4.1?

4.0 is an important but incremental improvement on the 3.x versions of Data Collector and Transformer. There haven’t been major engine-level or architectural changes, so we are confident the upgrade of engines to 4.0 should be a similar amount of work as the upgrade to a minor release of 3.x. We encourage customers to upgrade sooner rather than later, particularly if they are in need of the many Transformer features in 4.0. That being said, if customers strongly prefer to wait until the 4.1 or 4.2 releases, that is fine.

 

What’s being deprecated?

What do we mean by deprecation?

"Deprecation" means that we intend to drop and remove support for a feature or capability at some point in the future. The features will still be available to existing customers and fully supported until we remove them. That includes fixing critical bugs or regressions that arise in those features. However, we will no longer enhance or expand the functionality of those features. Moreover, we will actively discourage the use of deprecated features by net new customers.

 

What exactly is being deprecated in Data Collector 4.0, and what are the implications for customers who use these features?

Local user interface:

Data Collector 3.x and prior have a local user interface (UI) that allows users to design, manage and monitor pipelines specific to that Data Collector instance. That local user interface per each Data Collector instance will be deprecated as of 4.0 and removed in 5.0. Existing Data Collector 3.x users who have upgraded to Data Collector 4.x can continue using the local Data Collector UI for the lifespan of 4.x. The recommended approach for these existing Data Collector local UI users is to switch as soon as possible to use the Control Hub UI instead.

Data Collector stages:

The following connectors will be deprecated as of Data Collector 4.0 and removed in 5.0. If no version is stated, then all versions will be deprecated; where a version number is stated, only listed versions will be deprecated.

 

  • Aerospike
  • Apache Kudu
    • This is standalone Kudu not attached to a CDH cluster - Kudu on CDH will continue to be supported.
  • Azure Data Lake Storage Gen 1 (and Legacy connector)
  • Cloudera <= 6.3  - all stages
  • Flume
  • Greenplum GPSS Producer
  • HDP - all stages
  • Hive Streaming
  • Kafka Consumer (Legacy - not the Multitopic Consumer)
  • Kinetica
  • MapR <= 6.0
  • MemSQL FastLoader
  • NiFi HTTP Server
  • Omniture
  • SDC RPC
  • Spark Evaluator
  • Value Replacer (note this was deprecated in 3.x and will remain deprecated in 4.x, but not be removed until 5.0)
  • Teradata Consumer

Data Collector Feature: Cluster Mode Execution

Cluster mode execution of all types, including cluster batch, streaming, EMR, and Mesos, will be deprecated as of Data Collector 4.0 and removed in Data Collector 5.0.  

 

What exactly is being deprecated in Transformer?

Local user interface:

Transformer 3.x has a local user interface (UI) that allows users to design, manage and monitor pipelines specific to that Transformer instance. That local user interface per each Transformer instance will be deprecated as of 4.0 and removed in 5.0. Existing Transformer 3.x users that have upgraded to Transformer 4.x can continue to use the local Transformer UI for the lifespan of 4.x. The recommended approach for these existing Transformer local UI users is to switch as soon as possible to using the Control Hub UI instead.

 

Why are you deprecating the local UI for Data Collector and Transformer? We’ve been using it and switching to a new UI will cost us time & resources.

A significant number of users are indeed very accustomed to and like the Data Collector and Transformer UIs. However, we have been steadily investing and improving the Control Hub user interface, and our customers who use our data plane engines with Control Hub have seen numerous benefits and had more successful implementations. The Control Hub also has many new features such as automated deployment to cloud service platforms or support for Connection Catalog and Pipeline Fragments, which most customers find of extremely high value. 

We believe our customers will be more successful if they use the Control Hub user interface going forward, and we are thus making these changes. It will take a bit of time for users to adjust to the new UI, but we are confident that they will quickly see the benefits of switching. And as we are supporting 3.x Data Collector and Transformer for a long time, they can decide the right timing for them to make the switch.

 

We use cluster mode. What should we do?

Customers will be able to continue using cluster mode pipelines, stand-alone or with your Control Hub 3.x, throughout the lifespan of SDC 4.x. We do not intend to add any new functionality to cluster mode pipelines, nor stages that read from cluster mode pipelines as a deprecated feature. We will fix any documented bugs or vulnerabilities. 

In the long term, we strongly recommend customers using cluster mode to implement Transformer if they haven’t already, as that is the product that is specifically designed for horizontal scaling. Once Transformer is in place, customers can convert existing Data Collector cluster mode pipelines to Transformer pipelines. The StreamSets customer success team can help guide customers through this migration process. 

 

How long can we continue using the deprecated features?

As of the 4.0 releases of Data Collector and Transformer, the deprecated features will continue to be available to existing customers in their existing 3.x deployments as well as 4.x Data Collector and Transformer versions. Customers can continue using the 3.x versions of Data Collector and Transformer. The last versions of each, Data Collector 3.22.2 and Transformer 3.18, will be fully supported until June 30, 2023. The last version of 4.x, which will be the last to include these deprecated features, will be supported for 24 months after its release date.

The features will be fully removed and no longer available as of the 5.x releases of Data Collector and Transformer. 

 

What is changing in how users can get access to the Data Collector and Transformer 4.0 engines? Where do I download the latest 4.0 versions?

You should continue to access the downloads on downloads.streamsets.com, which is a password-protected site. The download link and password are available via Zendesk Support Portal in the knowledgebase article StreamSets Data Collector and Transformer Binaries Download

 

What are the changes related to close sourcing Data Collector? What does it mean to become a close source?

Currently, our source code is publicly available on Github under Apache Software License (ASLv2). Anyone may access the source code, and if they choose to do so, compile it on their own and use it. 

 

As of Data Collector 4.0, we will no longer make our source code publicly available, and we will not be accepting community contributions to the code base. Our development team will no longer use the SDC Jira project, which is public, to track future 4.x Data Collector enhancements, although the SDC Jira will remain open. All 3.x and earlier versions of Data Collector will remain on Github, but 4.x will not be available. 

 

I’m concerned about this change. Who can I talk to?

Please contact your StreamSets account executive if you have questions.

 

What’s happening with the support policy?

What are the key changes in support policy?

There are two main changes in the support policy. These should not be of any material impact to any customers.

  1. We are adding in a statement around customers operating in good faith.
  2. We are simplifying and updating our back support and end of life policy to simply say that we will provide support for the current release of our software and any versions of the software released within the prior 24 months. 

When is the new support policy going into effect?

May 25, 2021. From that point on, it will be effective for all existing and future releases of our software.

 

If you have any additional queries or concerns, please Submit a request via our Zendesk support portal or contact us at support@streamsets.com.


1 reply

I was looking at bringing StreamSets to my next organization. I have used it extensively in previous projects. It is sad to see the abandonment of the open source community with the release of 4.0.

Reply