Skip to main content

StreamSets has added support to run Transformer for Spark on a new type of cluster: Amazon EMR (Elastic MapReduce) Serverless. Choose it or any of the other available supported cluster types

What is Amazon EMR Serverless?

Amazon EMR Serverless is a feature of Amazon EMR that allows users to run big data processing workloads without having to provision or manage any compute resources. Which means you don’t have to know a lot about starting, stopping and managing clusters to get started. With Amazon EMR Serverless, users can focus on their data processing tasks without worrying about managing clusters or paying for idle resources.

Amazon EMR Serverless is a flexible option for processing big data workloads, where users only pay for the processing time that their jobs require. Meaning, the clusters automatically spin down when not in use. It also easily scales up or down as needed to meet workload demands.

The Benefits of Transformer for Spark + EMR Serverless

By choosing an Amazon EMR Serverless cluster to run Transformer for Spark pipelines, data teams can benefit from ease of use and scalability. For example, with EMR serverless you don’t have to provision and manage your own spark clusters, which in turn gives users access to the scalability and power of spark without being experts in the underlying hardware. In addition,

Amazon EMR Serverless can also easily scale up or down as needed to meet the demands of a workload. This can be especially important for data integration workloads, which can be highly variable in terms of data volume and processing requirements. As a cloud-native platform, StreamSets has scalability built-in and can adjust to clusters of almost any scale. 

Finally, users of Transformer for Spark and EMR Serverless can leverage both technologies to automatically spin down clusters when they are out of use or their associated pipelines finish.  Which means that data teams aren’t on the hook for potentially pricey idle time where no actual processing is being done. Scheduling pipeline jobs in StreamSets allows data teams to carefully consider, orchestrate, and monitor jobs to maximize this possible cost savings.

StreamSets is continually improving by adding new features, supported platforms, origins and destinations. Please join us in the StreamSets community for updates, use cases and more.

Be the first to reply!