Problem statement:
If you are doing DR testing or one of your Kafka broker is not available, you might observe an error message in the job log as below.
Error getting metadata for topic
.
error: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
Explanation:
Let's say you have 3 brokers in the bootstrap list and first broker is went down during DR or some other reason, if request made from the client to the first broker is timing out then pipeline will not make a retry to the next available broker , instead it will fail with an above “timeout” error. In ideal scenario client should traverse through all the brokers in the list before marking it as fail.
server1:9092,server2:9092,server3:9092
This could happen in the kafka client lib 2.6 or below versions. More details can be found in KIP-601 article.
Solution:
- Upgrade kafka client lib to 2.7 or above and tune socket timeouts accordingly. . In this version they have introduced below two configurations which makes socket timeout to be controlled by client-side.
socket.connection.setup.timeout.max.ms
socket.connection.setup.timeout.ms
- Decrease the tcp retry value from the file /proc/sys/net/ipv4/tcp_syn_retries to 3. ( Default 6)