Question:
We are looking to setup a new server in AWS to deploy containerized SDC instances (kubernetes or docker). Most of our pipelines are connecting to RDBMS or CSV files at this time, doing some processing and possible look-ups, and writing to HIVE / HDFS. We expect to have around 6-8 data collectors running on a single server.
What is a recommended server size in AWS to best support a deployment like this? SCH and MySQL are running on a separate server.
Answer:
The capacity planning of nodes is going to be a function of how much data will pass through the pipelines, the # of pipelines and the complexity of the executors and processors used in those pipelines.
You could scale up or scale out. Our recommendation would be to scale horizontally with smaller instances rather than big beefy instances.
- Remember SDC is a memory based architecture and do little with the local disk if at all - except logging of course. So, we don't really care about high-speed storage.
- The simplest route would be to go with a few 8 or 16 core instances with 16-32 GB of RAM and scale out horizontally.
- If this is existing on an on-prem infrastructure and you are migrating to the cloud, it would be good to look at your current on-prem memory, CPU, bandwidth usage of the pipelines and see if making the requisite changes on memory, CPU and bandwidth on those pipelines will help.
- Using our Kubernetes Provisioning agents through control hub will make your StreamSets Data Collector management a breeze.
Also, please refer the following sections from Capacity Planning and Performance Tuning Primer for more details.
Workload
Evaluating your workload is the first step for initially tuning a Data Collector instance. After
making some estimates on the workload we can estimate (guess) the amount of CPU,
memory and network capacity we should start with for tuning.
Performance tuning is an iterative process, but initial capacity planning sets the direction.
Running a heavy workload on a Data Collector requires a significant amount of memory,
CPU power and network throughput for best performance, but Data Collector’s initial "out
of the box" configuration is only appropriate for running a few, simple pipelines.
To initially provision memory and CPU, we need to examine (guess) at the workload. Three
related factors are at play here -
- the number of pipelines
- the complexity of each pipeline, and
- the "busyness" of each pipeline.
Number of pipelines seems fairly obvious.
Pipeline complexity is based on the number of and type of stages. Stages such as XML Parser, JSON Parser, Field Pivoter and the Language processors (Groovy, Jython, and JavaScript) often need a significant amount of memory.
The next factor, pipeline "busyness", is sometimes difficult to gauge and will often vary
significantly between development and production environments.
Consider the amount of incoming data - for example,
- Will there always be new files waiting for Directory Origin to process? This should be considered 100% busy.
- Will the data arrival rate into Kafka queues cause them to often contain unprocessed records? 100% busy.
- Will JDBC Origin be running in incremental mode which may, depending on data arrival
 speed, and the configuration of the Origin (specifically the value of “Query Interval”) leave
 significant idle time between processing batches of records? Perhaps this pipeline is 50%
 busy.
CPU
Get a good idea of the number of pipelines, figure out the %busy for each pipeline. Sum
the %busy values and we can make an initial guess at the CPU requirements.
For example, if the %busy values of all the pipelines add up to 800%, we need to provision
at least 8 cpus. Generally you don’t want to run the machine at 100% CPU used, and Data
Collector itself needs additional CPU resources to accommodate calculating metrics,
running the UI, garbage collection, overhead and other bookkeeping activities as well as
having available headroom in case of spikes in the incoming data’s arrival rate. Therefore, it
might be better to include some headroom and provision 12 or 16 CPUs.
Also, you’ll probably want to provision additional resources if it turns out that your %busy
estimates were low, or you feel there may be additional requirements and additional
pipelines to be added later.
Memory/Heap Space
The starting point for heap space provisioning, also another guess, is to provide between
300mb and 500mb per pipeline based on pipeline complexity.
Origins create lots of objects in the heap, typically, destinations free them. Processors such
as Field Type Converter and Stream Selector need less space in the heap, but still have a
footprint in the overall heap use. If you're unsure, just assign 500mb per pipeline to start
with. For simple pipelines, those with just an origin and destination, you can estimate
300mb per pipeline. Also, this number seems to be in line with Data Collector’s defaults.
For example, if you have 10 pipelines, 3 very complicated and 7 less so, you might provision
a 4g heap to start with. (3 * 500 + 7 * 300)
Heap size is simple to change and generally we set -Xmx and -Xms to the same value. You
can analyze the gc.log with a service such as gceasy.io. This service generates some nice
graphs, which include heap used before and after gc, and information about gc duration.
Gceasy.io also does some analytics on the gc.log, and in some egregious cases provides
recommendations.
Also, just to throw out another “rule of thumb”, if your heap is more than 10g, you might
want to try using G1GC, instead of CMS which is the default shipping with Data Collector.
Here is an interesting article about how G1GC works:
https://www.dynatrace.com/news/blog/understanding-g1-garbage-collector-java-9/
Network
In some shops, and when using virtual machines, your network provisioning options may
be limited. This will inform our other decisions regarding machine sizing.
IO
Generally IO throughput is not an issue for Data Collector unless you’re running many
pipelines, and they are all using directory origin or another local file-based Origin or
destination.
Selecting the correct size machine.
This is basically trying to determine if you should provision one “giant” machine or
provision more, smaller machines. There are several advantages to more, smaller machines. When considering Failover, High Availability, Disaster Recovery and Business Continuity, it’s generally better to have more smaller machines. In the Failover case, SCH has more options to select a Data Collector machine when several machines are available. Similarly for High Availability for SCH or InfluxDB, you generally need 2 or 4 machines to support HA configurations.
For DR, and Business Continuity you may need to deploy enough machines to support Failover and HA configurations in each data center so you can maintain long term operations with one data center down.
In addition, utilizing more smaller machines makes it less expensive to deploy additional
resources.
Mufeed Usman
