Skip to main content

I currently have a problem to see in which field i can use:

Concrete example:
I used Google Cloud Storage Destination which offers and i wanted to make use of some 
values from the records in the name of the file i create so like that i wanted to set:
Object Name Prefix or Object Name Suffix to:
sdc-order-${record:value('/ORDER_ID')}-${record:value('/ORDER_TYPE')}

 

Doing that results in error:
“CREATION_005 - Could not resolve implicit EL expression 'sdc-order-${record:value('/ORDER_ID')}-${record:value('/ORDER_TYPE')}': com.streamsets.pipeline.api.el.ELEvalException: CTRCMN_0100 - Error evaluating expression sdc-order-${record:value('/ORDER_ID')}-${record:value('/ORDER_TYPE')}: javax.servlet.jsp.el.ELException: No function is mapped to the name "record:value"”

i had to find out that this does not work.
I could only use:

  • Plain Text
  • Pipeline Paramaters ( ${pipeline_parameter})

So how can i see what works in which area as it costs quite some time to figure at what works and what does not work.

 

Many thanks for your help.

hi @robert.bernhard 

the quickest way of finding out what functions are accepted by a specific configuration field is to hit Ctrl-Enter on the field itself; it will show a pop-up with a list of functions / parameters (if you have any set in the pipeline) that you can use, as you can see below.

In case of the GCS destination, if you start typing ‘r’ you will see that the record:value() function is not actually listed as available.

I hope this helps!


Hi @robert.bernhard 

Logically also don't you think it's a bad idea to have the file name based on a record parameter? Imagine there are 1000 records need to be processed and if each record have a different value of that field then it's going to create 1000 different files which you may not want and doesn't seem to be a practical use case.

Regards

Swayam

 


@robert.bernhard , the key thing to remember is that GCS is NOT a normal filesystem - you can’t write individual records to an object - you have to write the entire object in one go ; hence when writing to GCS (same with AWS’s S3) you can’t really direct individual records to different objects. And actually you should consider using a larger batch size too, to optimise the write. 


Another technique could be to write files to a locally mounted drive and then move the files (once they've reached a predefined size / number of records) move them to the GCS data lake.


@Giuseppe Mura that is a good hint and helping.
Still i think it would be good if all functions are supported in such fields and i must say the drop down is a mix of fields and functions. So someone you think that you write ${/ORDER_ID} as it is offered inside
but it is not possible as you can only use inside record.value(‘/ORDER_ID’)



Do you see any way to make use there a field from the record or header ?

@swayam it depends what you want to do. 🙂 Just that you pick a column from record it does not mean that you get 1000 files ? Just imagine you have 2 types of orders ONLINE and STORE orders and you want to split into 2 different files. As well by default streamsets makes use of a GUID inside file name. Another user case you want to create one file per table received from CDC and then you would like to use e.g. record.attributes(‘oracle.cdc.table’) in the folder name or file name.
 


@Giuseppe Mura i am aware of what can be done by GCS. I did not intend to append data to files.
As if it is based on data you want to handle you normally just create files and read them in sorted by time stamp. So any data coming in would be new files. 

To be honest i wondered why the field partition exists in the adapter for gcs.
GCS has only virtual directories as internally every thing is at the end a file. 
So it just shows a structure like directories.

So:
partition: directory
global name prefix: /
would be same as:
partition: /
global name prefix: directory/


Hi @robert.bernhard 

If you have a need to split files based on order type ONLINE or OFFLINE, then the best thing is to use a stream selector to route traffics towards 2 different branches based on the order type. By doing that at least the developer is aware of the fact that 2 different files are going to be created. However, if SDC allows the developer to use the record value as a dynamic parameter in the file name, then imagine when there is a data drift (and the value of of order type can be anything as it comes from a source platform where data sanity may not be gurenteed), SDC will have no control on it.

I hope your use case can be addressed using the stream selector in a more elegant way.


@swayam Thanks for your suggestions. We just evaluate the product to see what is possible and what not. Therefore i have seen that functions can be not used in all places and the hint from @Giuseppe Mura helped to see where what works. We will see that there are many ways to solve things in steamsets. So i feel we drift away from the initial problem and question. I see a need for some use cases for better control on file name. Sometimes you even integrate with a system which has a fixed way to integrate with files and you need to adapt to that. If you have no control on order type then you have another problem. And as well in this case as you already mentioned could use stream selector to select this cases as default out and inform via error. So no problem for last stage of writing files.


Reply