Recently active topics
File list-url's in a file to process
Hi Guys, I have a requirement where I have thousands of json file url’s in a file(all files are of same format). I need to process the data for every file and load data in to destination. Ex: local_dir/file_list.txthttps://example1.jsonhttps://example2.jsonhttps://example3.jsonhttps://example4.json
how to convert timestamp and write into mysql .
I am using field type convertor to convert the column into time stamp and write into mysql but it only write 2023 year details how can i write 1970 year ,fix it by any other processor.this is my configration 1970 year data not writable in mysqlin mysql only 2023 data will be write
Reading avro file from a bucket similar to S3 and trying to convert into parquet
I am reading an avro file from a custom cloud similar to S3 and trying to conver it into parquet file using whole file evaluator but its giving an error Record1-Error Record1 CONVERT_01 - Failed to validate record is a whole file data format : java.lang.IllegalArgumentException: Record does not contain the mandatory fields /fileRef,/fileInfo,/fileInfo/size for Whole File Format. : (View Stack Trace...
JDBC Multitable Consumer error
Hi, I am new to Streamsets, started using it today.I have a task to create some backup data to hive and hadoop using jdbc multitable consumer.the problem that occurs is that there is a table name with '-' instead of '_' example: call-outcomme_id should the correct one be call_outcomme_id.I have used the field remover to maintain the table name with 'keep listed fields' when I start an error occurs, but when I select 'remove listed fields' the column that I targeted was deleted and it worked.thanks in advance
Error : Java heap Space
I have 98 columns/fields and 250,000 rows, every time I run it always error, what i'm doing now is to reduce the :Max Batch Size (Records) = 10Max Clob Size (Characters) = 10Max Blob Size (Bytes) = 10Fetch Size = 10because the default is 1000 and an error will occur. How to handle that problem?
Subscriptions created for create incident for failed job in ServiceNow has not triggered.
I have created subscription as suggested in related topic but that subscription is not triggered. I am able to create incident at pipeline level when pipeline fails, using the notification tab by giving webhook details for incident table.Also, I want some information about job failure in incident table for e.g. Short description like job_name is ‘abc’ failed and their error details. Is this possible using subscription? If yes then please help me on the same.
How to troubleshoot connection timeout related issues in Azure Synapse stage
Issue: In Azure Synapse stage, there might be some errors that initially look like the issue with loading/writing data to Synapse, but it could be caused by connection timeout:AZURE_STORAGE_07 - Could not get a Stage File Writer instance to write the records: 'AZURE_STORAGE_11 - Could not load file to Azure Storage. AZURE_STORAGE_02 - Azure Synapse load request failed: 'AZURE_DATA_WAREHOUSE_09 - Could not merge staged file com.streamsets.pipeline.api.StageException: AZURE_DATA_WAREHOUSE_00 - Could not perform SQL operationTo determine the root cause, please turn on debug log by adding the following to log4j to further investigate the issue:logger.l5.name = com.streamsets.pipeline.stage.common.synapselogger.l5.level = DEBUGlogger.l6.name = com.streamsets.pipeline.stage.destination.datawarehouselogger.l6.level = TRACEYou might be able to get more hints in debug log and see if it’s related to connection timeout, such as:Connection is not available, request timed out after 30000ms. Channel
JDBC Lookup Stage Performance Troubleshooting
How to troubleshoot: JDBC Lookup Performance Issues When facing performance issues within the JDBC Lookup stage there are a variety of different variables which may cause or contribute to the issue. Provided below are some initial techniques which you can add to your tool belt to assist you in diagnosing this issue.* To note, while this article is targeted specifically for diagnosing JDBC Lookup issues the same principles of steps 1 & 2 can be used towards JDBC Query Consumers.Step 1: Enabling DEBUG LoggingEnabling DEBUG can often provide a lot of the information you need to determine a next action for investigation. It can provide additional context to error messages which you may have observed previously, but it may also show errors which were previously not available. Additionally, it can also provide information which - while not an obvious error - can provide crucial insight into the pipeline operation such as statistical data. Debug can be configured in Data Collector via the
Connection management - is it possible to switch environments seamlessly?
In comparison with Informatica and SSIS, I am puzzled by the connection management in Streamsets.If I build a pipeline that reads from a MSSQL, and a connection for this server and database is created, I can use it as the source. However, at the beginning, I only want to connect to a DEV instance, and the connection is dedicated to it. After I finish testing and want to get into user acceptance testing (UAT), I can’t use this connection. It feels that my pipeline would have to be modified so that I use a connection that’s dedicated to UAT. But, this isn’t right! The pipeline shouldn’t need to be changed!In Informatica or SSIS, when I need to switch from DEV to UAT, my “pipeline” doesn’t need to be changed. There is always a mechanism that enables the connection to be switched from DEV to UAT, seamlessly.I imagine Steamsets also has a way to enable seamless switching of environments with regard to connections. But, I can’t find it. I would appreciate it very much if someone could share
Incorrect job metrics in orchestration pipeline
I want to evaluate the job metrics field of orchestrator tasks JSON, which is the output of start job origin in orchestration pipeline. The start job origin contains pipeline which moves 5 records from csv file from directory to snowflake table.I need to confirm from the job metrics field that the pipeline output count equals that of input count.But when I preview the orchestration pipeline, both the input and output records count show 0.Is there any additional configuration required that I have missed? I have attached the preview SS along with the job instance pipeline.
Error HTTP_42 - Failing stage. Connection timeout as per configuration
Issue: Error HTTP_42 - Failing stage. Connection timeout as per configuration in HTTP Client processor/destination.Solution: We recommend to increase the "Max Request Timeout" configuration in the HTTP stage. (Default is 60 seconds).If increasing "Max Request Timeout" doesn’t solve the error, please perform networking testings such as generating tcpdump to investigate your networking environment and open a ticket with Support for further troubleshooting.
Hi Team, I have few queries on Architecture and feature support: When using Streamsets cloud ( SaaS ) , Can i deploy control pane in our network? or Control pane resides in Streamsets boundaries wheres as the processing takes place in Clients AWS account. For later, i have a followup question, that will client needs to install agent to communicate with control pane or does control pane requires direct access using some kind of cross account role to spin up / manage and spin down resources like EMR etc ? Does any data ( in preview or Debug Mode ) goes back to Control Pane or Streamsets cloud infra? Is CDC supported for MongoDB, DynamoDB, PostgreSQL, AuroraDB ? With Kafka, does it supports Kerberos based authentication and authorisation?Can i replay the data any any point in the pipeline?Does it offers connectivity to on-premise databases over TCPS protocol?Does it offers push based processing for sources like Oracle, SQL-Server, Snowflake? Finally , does Streamsets supports
Write to a already existing text file in LocalFS
I have a text file in my “/tmp” directory named “data.txt”Now every time when I run the pipeline, I want to write data to the same file.I was trying to use LocalFS but I couldnt find anyway wherein I could append data to the same file.Can anyone help me out with this?
Connecting to Streamsets OPC UA
Hi,I was trying to connect to OPC UA client via different OPC UA free servers like Ignition, Prosys OPC UA Simulation Server, Integration Objects OPC UA Server Simulator but was unable to connect to it. The Streamsets OPC Client was refusing to connect to it. However, I tried connecting Integration objects OPC client with its server and it was able to connect. Can someone guide me through the steps to connect to OPC UA Client and also mention any OPC UA servers.
How to Modify Fragment icon ?
By default, the pipeline canvas displays each fragment in the pipeline as a single stage with a puzzle piece icon: Now you can modify the icon to represent the fragment processing logic. You can select from a set of predefined icons, or you can upload an image to use as the icon. Uploaded images cannot exceed 100 KB in size.While viewing a fragment in edit mode, click the General tab in the properties panel. Click the Edit icon next to the Icon property: In the Icons dialog box, select one of the predefined icons or upload a custom image to use as the icon. Note: You can upload a single image to the dialog box. Uploading another image replaces the existing uploaded image. After you publish the fragment, all pipelines that use that fragment version display the fragment as a single stage with the modified icon. More information about fragments could be found here.
DataOps Platform - REST API Origin - How to invoke pipeline?
Hi All,I created a microservice pipeline with REST API Origin using the DataOps platform as per video: https://www.youtube.com/watch?v=wIZWMV1bMl4Now, I am unable to invoke the pipeline using 3rd party tools like Postman. I have chosen default settings in SDC and headers as (X-SDC-APPLICATION-ID:sdc_microservice & Content-Type:application/json) yet I’m facing issue as “Error: connect ETIMEDOUT xx.xx.xx.xx:8000”.However, when I use the same URL in a curl command where SDC container is running, I see, 200 OK response. curl -i -X GET http://xx.xx.xx.xx:8000/rest/v1/user --header "X-SDC-APPLICATION-ID:sdc_microservice"There is no connectivity issue. So, can anyone tell me why I am unable to use Postman tool or if I am missing anything here?Below are the images showing the same:
Facing error in On-prem setup of sch
I was trying to do the on-prem setup of SCH in which I have completed till configuring the control hub, relational databases, system data collector etc.Now when I am going to create the required tables for each databases using “sudo dev/01-initdb.sh” command the tables are getting created for the first database i.e. “Security” and it is throwing error in the next database i.e. “PipelineStore”From the marked red arrow the error is staring.Can someone help me regarding this?
Understanding and Resolving the NoClassDefFoundError: Common Causes and Solutions
The NoClassDefFoundError is a runtime error in Java that occurs if the Java Virtual Machine (JVM) or a ClassLoader instance attempts to load the definition of a class that could not be found. The class definition exists at compile-time but is not available at runtime.The definition of the class is attempted to be loaded as part of a normal method call or creating an instance of the class using the new expression and no definition of the class could be found. Therefore, it can occur during the linking or loading of the unavailable class.Common causes of the class definition being unavailable at runtime and solutions to try:Missing JAR file: First, identify the JAR file that contains the class for which you are receiving the NoClassDefFoundError. Once identified, verify if the JAR file exists in the installation directory structure of the StreamSets engine. Alternatively, check if it is located in other paths referenced by the data plane engine (e.g., the USER_LIBRARIES_DIR variable, whi
StreamSets has added support to run Transformer for Spark on a new type of cluster: Amazon EMR (Elastic MapReduce) Serverless. Choose it or any of the other available supported cluster types. What is Amazon EMR Serverless?Amazon EMR Serverless is a feature of Amazon EMR that allows users to run big data processing workloads without having to provision or manage any compute resources. Which means you don’t have to know a lot about starting, stopping and managing clusters to get started. With Amazon EMR Serverless, users can focus on their data processing tasks without worrying about managing clusters or paying for idle resources.Amazon EMR Serverless is a flexible option for processing big data workloads, where users only pay for the processing time that their jobs require. Meaning, the clusters automatically spin down when not in use. It also easily scales up or down as needed to meet workload demands.The Benefits of Transformer for Spark + EMR ServerlessBy choosing an Amazon EMR Serve
Details on SDC Metrics
QuestionHow are the SDC Metrics collected and what do they represent?AnswerThe graphs on the SDC Metrics page in StreamSets Data Collector and the Metrics tab of the Execution Engine page in Control Hub are system-level metrics collected from standard Java core libraries. They should roughly correspond with metrics reported by other system-level tools, such as top and uptime at the command line, as well as external monitoring tools that show system-level metrics. Note that the numbers reported by different tools won’t match each other exactly due to differences in reporting intervals and other factors, but there should be a correlation.
Already have an account? Login
Login to the community
No account yet? Create an account
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.