Got a Question?
Can't find what you're looking for? Ask it here!
- 298 Topics
- 651 Replies
I have a use-case, my org. is using Data Collector 3.21.x for ingesting non-sensitive data to a S3 bucket. We currently have a requirement to bring sensitive data through data collector into a S3 bucket (specifically for sensitive data). My questions,I do not want all users who have access to StreamSets Data Collector to view the pipeline that brings the sensitive data. Can I do that with the open source version (3.21.x) of Data Collector? If yes, please advise How?Would it be possible to restrict users to view or NOT view a set of pipelines in StreamSets Data Collector (open source version). Example: I want ONLY members of the Finance team to view Finance pipelines in Data Collector. Note: I know, I can spin up a separate instance of Data Collector just to handle sensitive data but I do not want to go down that path. Thanks,Srini
I'm running SDC in a Docker container and I am trying to invoke a HTTP call with HTTP Client using a restricted header like below:curl -X GET -H "Host: my-host" http://my-service/my-endpoint However, by default the header is ignored with the following warning:HttpUrlConnector - Attempt to send restricted header(s) while the [sun.net.http.allowRestrictedHeaders] system property not set. Header(s) will possibly be ignoredWhere do I set this property? Adding it to SDC config did not help.
When writing to Amazon S3 destination why are these “WARN” messages showing up in the logs? Does this have any impact on the running pipeline? No content length specified for stream data. Stream contents will be buffered in memory and could result in out of memory errors.
When writing to Kafka Producer destination it doesn’t always handle timeout exceptions (as shown below) and the pipeline does not honor On Record Error » Send to Error setting on the Kafka Producer destination. How can this be resolved?Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms. at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:1186) at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:880) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:803) at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:690) at com.streamsets.pipeline.kafka.impl.BaseKafkaProducer09.enqueueMessage(BaseKafkaProducer09.java:64) at com.streamsets.pipeline.stage.destination.kafka.KafkaTarget.writeOneMessagePerRecord(KafkaTarget.java:242) ... 30 more Caused by: org.apache.kafka.common.errors.TimeoutExc
Team, We are processing data from the Hadoop Avro file to SQL server DB. Whenever the SS process is running, SQL DB performance is degrading due to Mapper execution. We checked with DBA and they suggested decreasing the Insert connections. We need to decrease the mappers (stable) for this data load. Is there any way to limit the mapper creation/insert operation? Appreciate your help!
Hi Team,I am trying to read the files from AWS S3 using the Amazon S3 origin. It does read the file at first level but doesn’t read the file recursively.In the below image, for bucket “s3a://sotero-transformer/input/”, it reads the files userdata1.parquet and userdata2.parquet but doesn’t read the files under “dirone”. What should be the configuration to read all the files under S3 bucket recursively?
We want to create an integration test suite and run it in a self-contained world of headless containersThe high level goal is docker-compose to start all containers streamsets kafkaconfigure streamsets send data to kafka topic StreamSets processes the data in the topic and send its to other topics Application under tests, processes the data (fails / passes) - the testtear down Question:How do I configure streamset pipeline without using the UI?
Trying to read XML in the following format with “:” is throwing Can't parse XML element names containing colon ':' error.<sh:root> <sh:book> </sh:book> <sh:genre> </sh:genre> <sh:id> </sh:id> <sh:book> </sh:book> <sh:genre> </sh:genre> <sh:id> </sh:id> <sh:book> </sh:book> <sh:genre> </sh:genre> <sh:id> </sh:id></sh:root>What’s the best way to read such XML? (Note that changing “:” to “_” in the XML works.)
I'm trying to enable Kerberos for my SDC RPM installation, but when I start the SDC I get following exception:java.lang.RuntimeException: Could not get Kerberos credentials: javax.security.auth.login.LoginEx Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:897) at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)How do I move forward?
I have a StreamSets Data Collector running in Docker and when I run a pipeline with Kafka Consumer I am seeing these error messages:The configuration = was supplied but isn't a known configThe configuration schema.registry.url = was supplied but isn't a known config How do I get past this error?
I have a few JDBC-based stages in my pipeline (Origin, JDBC Lookup, etc.) and when I try to replace the existing JDBC-based origin with another (for example, Oracle CDC with MySQL Binary Log), the validation just on the lookup processors fails with “Failed to get driver instance with multiple JDBC connections” error. Even though I haven’t changed anything on those processors.Here’s the stack trace…java.lang.RuntimeException: Failed to get driver instance for jdbcUrl=jdbc:oracle:thin:@connection_URL at com.zaxxer.hikari.util.DriverDataSource.<init>(DriverDataSource.java:112) at com.zaxxer.hikari.pool.PoolBase.initializeDataSource(PoolBase.java:336) at com.zaxxer.hikari.pool.PoolBase.<init>(PoolBase.java:109) at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:108) at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:81) at com.streamsets.pipeline.lib.jdbc.JdbcUtil.createDataSourceForRead(JdbcUtil.java:875) at com.streams
Unable to write object to Amazon S3: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I am trying to write to Amazon S3 destination with its Authentication Method set to AWS Keys, but when I run the pipeline I get “Unable to write object to Amazon S3: The request signature we calculated does not match the signature you provided. Check your key and signing method.” error.Here’s the entire stack trace…Caused by: com.streamsets.pipeline.api.StageException: S3_21 - Unable to write object to Amazon S3, reason : com.amazonaws.services.s3.model.AmazonS3Exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 12345678915ABCDE; S3 Extended Request ID: xyzxyzxyzxyzxyzxyz=; Proxy: null), S3 Extended Request ID: xyzxyzxyzxyzxyzxyz=at com.streamsets.pipeline.stage.destination.s3.AmazonS3Target.write(AmazonS3Target.java:182)at com.streamsets.pipeline.api.base.configurablestage.DTarget.write(DTarget.java:34)at com.streamsets.datacoll
I want to run data collector on machine. Have gone through some git readme.md and some blogs too, they are asking to download the tarball, but I am not able to find the url to download the tarball. Can someone help with this, or with some other way to download the tarball?
Hi,I'm currently using SDC 3.21 and I'm hitting the error that is also mentioned in this thred - https://issues.streamsets.com/plugins/servlet/mobile#issue/SDC-12129Any suggestions on how to resolve this issue permanently. At present the only workaround that I've is to restart StreamSets. I did that in development (local) environment. But that's not an option in Production.RegardsSwayam
Is there a prebuild processor/component which captures no. of records processed through stages and other logging events ? We have requirements to capture no. of records processed and other logging events and possibly store them to log files/MySQL stages
While trying to inject XML data from S3 into snowflake, facing the below error :S3_SPOOLDIR_01 - Failed to process object 'UBO/GSRL_Sample_XML.xml' at position '0': com.streamsets.pipeline.stage.origin.s3.BadSpoolObjectException: com.streamsets.pipeline.api.service.dataformats.DataParserException: XML_PARSER_02 - XML object exceeded maximum length: readerId 'com.dnb.asc.stream-sets.us-west-2.poc/UBO/GSRL_Sample_XML.xml', offset '0', maximum length '2147483647'Size of the XML file is 4MBThe properties used for Amazon S3 component has been attached.Also, Increased the Max Record Length size to its max.S3 Properties- Max Record Length size : 2147483647 Data Format : XML Can you Please suggest on this. Is there any size related constraint associated?We have successfully loaded smaller files from S3 to Snowflake.
Dear StreamSetsWe have an requirement to transform Complex XML data into JSON using XSLT. This needs to be done in DataCollector. The incoming file will contain millions of records and for each record, we need to apply XSLT and write the output to S3 location.I could not find resource on support for XSLT in DataCollector documentation. Could you please help me with this query? Note 1: We also have similar use case to transform JSON data to XML. Does StreamSets support usage of FreeMarker in DataCollector pipeline.Note 2: For both XSLT and freemarker, both uses external java functions to support transformationNote 3: For both XSLT and freemarker, they are compiled once for the run for better performance. RegardsVaradha
Hi, I have a XML as shown below<events> <event> <type>online</type> <event_date>1-Jan-21</event_date> <feedback_status>Closed</feedback_status> </event> <event> <type>online</type> <event_date>1-Jan-20</event_date> <feedback_status>Closed</feedback_status> </event> <event> <type>online</type> <event_date>1-Aug-21</event_date> <feedback_status>Open</feedback_status> </event> <event> <type>offline</type> <event_date>1-Mar-21</event_date> <feedback_status>Closed</feedback_status> </event> <event> <type>offline</type> <event_date>1-Feb-20</event_date> <feedback_status>Closed</feedback_status> </event></
Already have an account? Login
Login to the community
No account yet? Create an account
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.