Skip to main content

Which ports need to be open for communication between Control Hub (SCH) and Data Collector (SDC)?

  • February 19, 2022
  • 0 replies
  • 292 views

AkshayJadhav
StreamSets Employee
Forum|alt.badge.img

Overview:

 

You're probably aware that Control Hub is the command-and-control center for data pipeline jobs, while Data Collector is the worker agent that runs these jobs on specific hosts. Typically we think of jobs as running production pipelines in the background. However, authoring a pipeline (say running a preview or snapshot) also executes jobs on a designated SDC, and that communication happens directly with the user's web browser, even though it appears in the Control Hub UI. So the components of our system look like this:

Browser <-> Control Hub

If you are using the StreamSets managed Control Hub instance at https://cloud.streamsets.com/, then the web configuration is handled for you. All communication happens on the default HTTPS port 443.

If you have an on-premises Control Hub then you can choose any port to run on, as specified in the dpm.properties file. We recommend using HTTPS, so 443 is still a good choice. This port will need to be open for inbound/outbound communication to end users (eg corporate VPN).

 

Data Collector -> Control Hub

All communication between the Data Collector and Control Hub happens one-way. This was designed specifically so that customers don't have to create firewall rules that allow incoming traffic to their Data Collectors. Instead, Control Hub puts jobs into a message queue, and the Data Collector regularly polls the queue via a REST API.

So, whichever port on which you choose to run Control Hub (say 443), will also need to be open for inbound access from the DC hosts. The DC hosts only need outbound access on that port (not inbound).

At this point the Data Collector should be able to successfully execute jobs.

 

Browser <-> Data Collector

In order to author pipelines the user needs to access a Data Collector instance, either directly via the built-in DC UI or via the Control Hub pipeline designer. By default the Data Collector runs on port 18630, so you would need both inbound & outbound rules for that port targeting the end user's network (eg corporate VPN).

Note that if you are running Control Hub over HTTPS (as is the case for our Cloud instance), then the Data Collectors will also need to run HTTPS in order for them to be accessible for authoring. This can be configured in the sdc.properties file. SSL certs should be signed by a certificate authority. If your DC uses a self-signed cert then the end user will need to manually trust the certificate in their web browser before the DC can be used for authoring.

 

Summary

For this example, let's assume that Control Hub has been deployed on port 443, and Data Collector will use the default of 18630:

Service Inbound Outbound
Control Hub

443 from the end user network

443 from the DC network

443 to the end user network

443 to the DC network

Authoring DC 18630 from the end user network

18630 to the end user network

443 to the Control Hub

Non-authoring DC N/A 443 to the Control Hub
Did this topic help you find an answer to your question?

0 replies

Be the first to reply!

Reply