30-Day Free Trial: It’s Never Been Easier To Get Started With StreamSets
Get inspired and gain all the knowledge you need
Recently active
Hi,We are processing S3 files with the batch size of 1000, the output also we are planning to store in S3 only. But we have he input file of 10000 records so we are seeing 10 output files in s3. As per client requirement we need to create a single file.Is there a way to create a single S3 output file from Streamsets.
My pipeline is configured to pick data from JDBC Multitable Consumer Origin and put it in Hadoop FS Destination. My requirement is to rename the output file at destination with TableName_TimeStamp.Able to achieve TimeStamp using Expression Evaluator. How do I get TableName as an event passed from Hadoop Fs to be used in HDFS File Metadata?
In the setup deployment step of the streamsets tutorial, when I run the update-nodes.sh script in the strigo environment, I get the following error every time:Error response from daemon: enpoint with name wonderful_volhard already exists in network streamsets-coreError response from daemon: enpoint with name wonderful_volhard already exists in network streamsets-integrationsError response from daemon: enpoint with name wonderful_volhard already exists in network streamsets-cookedAlso, the tutorial says I should see 3 engines in the control hub, but I only see 2.
Hi, I have a product table named ‘product’ in MySQL as follows:product_id | Product | FieldName1 Milk milk2 Water water3 Coffee coffee Then, I have a source fully de-normalized table named ‘raw_transaction’ as follows:transaction_Id | Date | customer | milk | water | coffee | 1 1/1/2021 John 12 1/1/2021 Mary 1 13 1/1/2021 Anna 1 Can you give me a hint on how I can create a pipeline in StreamSets so that I can use the product table as meta-data in creating a dynamic query so that I can populate a ‘FactCustomerProduct’ as follows For each product in products INSERT INTO FactCustomerProduct (product_id,date_id,customer_id,transaction_id,quantity) SELECT p.product_id,r.date_id,customer_id,r.transaction_id,r.<fieldName> FROM ‘raw_transaction’ r [...] WHERE r.<fiel
The configuration is made according to the linked document CDC documentationhttps://docs.streamsets.com/portal/datacollector/3.17.x/help/datacollector/UserGuide/Origins/OracleCDC.html#concept_rs5_hjj_tw run into the below errorJDBC_52 - Error starting LogMiner Caused by: com.streamsets.pipeline.api.StageException: JDBC_603 - Error while retrieving LogMiner metadata: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
How does JDBC destination handle fields without matching columns? I am encountering a situation where it appears that the fields without matching columns are ignored, but this is not defined in the documentation and seems counter to what I would expect (an exception complaining about a column not existing in the table). Please provide a deeper description of what is happening here.
Is there a maximum size for a pipeline title?
Existe um diretório criado no IIS (Windows) e publicado via FTPS.Ao tentar usar este diretório em StreamSets usando o componente SFTP|FTP|FTPS, ele retorna o seguinte erro: "REMOTE_11 - Unable to connect to remote host 'ftps://ftps.hostname.net:921/PLV' with given credentials. Please verify if the host is reachable, and the credentials and other configuration are valid. The logs may have more details. Message: Could not list the contents of "ftps://ftps.hostname.net:921/PLV" because it is not a folder. : conf.remoteConfig.remoteAddress" As credenciais estão OK, pois permite abrir este diretório usando o serviço LFTP no LinuxEstá faltando alguma configuração de StreamSets para corrigir esse problema?
Hi,How do we extract all the pipeline names authored and jobs commited by an user from Control HUB? Regards, Anirban
im using oracle as a source and s3 as destination.im ingesting the records from the source and adding table name as a column through expression evalutaor. i want to use this table name and create a folder in s3 dynamically before dropping those records in s3 bucketso the folder name should be created dynamically through streamsets fetching the name of the table. what should be the approach for the same?for ex: if I'm fetching records from table abc , I need to create a folder called “abc” and drop all the records inside that folder.
HI,Could you please provide a snippet of connection establish to mongodb from groovy script.
When the JDBC query component in executors is used to empty table data, it will not stop after starting the task. Note: pipeline finisher has been used on the java script component SQL QUERY:delete from depart_passenger_info
Can you please help me on below points.How we can get to know how many times StreamSets has retried for failure records ?what is their data ?What value we should give in Base backoff interval field ? and what all settings we have to do. bcz for me i am not seeing the incoming data to streamsets is matching with processed records. the difference between both is keeps on increasing. Can you please suggest something on this.
HI, I have tried copy all the files from one folder to another folder with in same s3 bucket using streamsets job. But I am seeing only 1 or 2 files copied into destination folder compared to source folder(like in source folder if 7 files are there, but in destination folder I am seeing 1 or 2 files are copying). Can any one help me on this issue. ThanksMurali
in Control hub i will know what are values are present in action for Field Remover but in SDK how to know.field_remover = pipeline_builder_14.add_stage('Field Remover')for field_remover.action, what values are present how i will know through SDKThanks,Ashok.
Lookups (into DeltaTable) giving extremely bad performances (sometime it stays in pre-execution stage forever) when used in Transformer with origin of 1000 records, although, it works decent enough in streaming mode which i guess is due to the lesser number of incoming records.
Hi team,how to add new stages to an existing pipelines using streamsets python SDK?
i want to extract multiple fields from JSON/XML using XML Parser etc..i am able to extarct with groovy but i want to achive like belowreading a file from S3 using data_format as XML extarct multiple fileds from XML in step 2<body><head>1</heaad><m>3</m><tail>2</tail><body>in step 2 i want to have 2 values in my output with out using any groovy etc..i want to achive using XML parser or filed mapper etc.. as of today i see only one value i can extarct these ex : /body/headbut i want to extarct both /body/head/body/tail
I’m looking for a tool to help prevent missing job version updates when moving code from one environment to another (development / UAT / production).If we update a pipeline and job in our development environment but move ancillary code to our test environment a week later, I’ve seen it is easy for our team to miss the test environment job update, resulting in wasted testing time. It would be helpful if we could see at a glance the differences in job versions before doing a deployment to help validate that we have the correct list of jobs to update as a part of that deployment.I see the REST API and could put together a script to compare versions, but I was wondering if there is any way we could visually see that in Control Hub, or even a way to build a pipeline or report that could be run to give this information. Thanks in advance!-Spyder
A pipeline’s origin is an S3 bucket. Error records are configured to “Send Response to Origin.” What exactly happens to the error records in this instance?
I am trying to create transformer pipeline using python sdk , unable to connect transformer engine.. i am getting two id and url for sch.tramsformers command . Please help me
Hi Team,To fix log 4j vulnerability , I upgraded the streamsets version from streamsets/datacollector:3.18.1 to streamsets/datacollector:4.2.0. After that, I am not able to create a new /import pipeline. User Interface asks me to connect to control hub or enter activation code which was not the case in version 3.18.1.
HI, I have tried copy all the files from one folder to another folder with in same s3 bucket using streamsets job. But I am seeing more files copied into destination folder compared to source folder(like in source folder if 7 files are there, but in destination folder I am seeing more than 7 like … 8 or 10 or 12). But this issue is coming only for first time of the day. If I run same job again for the day I am seeing record count matching between source and destination. Can any one help me on this issue. ThanksMurali
I would like to learn how to use the python SDK. How do I go about getting an activation key for use with a personal account?