Welcome to the StreamSets Community
Hello StreamSetters! I am excited to announce this quarter's StreamSets Community Champions! Drum roll please... A huge congratulations and thank you to @swayam & @Srinivasan Sankar for all their efforts and contributions towards our community. Swayam has been at the forefront of creating community articles and submitting example Pipelines while constantly helping others. Srinivasan is also one of our most engaged members with 13 posts created, 35 likes given to other members, while receiving 11 likes on his topic. Lastly, a major kudos for helping others finding the correct solutions! @swayam & @Srinivasan Sankar please look out for a Community Platform DM. I will be reaching out needing your shipping address. I will be sending each of you a StreamSets Community Swag pack, Academy certification coupon code that covers 100% of the cost to take the exam. Lastly this badge will be added to your profiles! What's New DataOps Summit CFP is now OPEN! Submit today. Snowpark Day:
In this issue we have many new and exciting community updates, engaging programs, and a new contest where you can win a $250 gift card! Check out others' submissions.Before we get into it, I would like to introduce everyone to our new Developer Advocate to the StreamSets team and our community, Brenna. She will be creating engaging content around the StreamSets product line though blog posts, tutorials, sample code, and present at meetups!Now let’s get into it, Data Rock Stars!What's NewHot new Ebook: Data Engineers' Handbook for Snowflake Win $250 by participating in the “Share Your Pipelines” contest! New Slack Channel - #community-newsletter-share-your-findings Over 400 knowledge base articles, previously not seen by our external community, have been added to the community platform DataOps Summit CFP opens 2/22. Stay tuned!Industry News/Helpful Reads:Stories about Data Engineering on Medium The vast majority of data engineers are burnt out. Those working in healthcare are no except
Here's a Kafka Stream Processor pipeline that reads events from a "raw" topic, performs streaming transforms, and publishes the transformed events to a "refined" topic that multiple downstream clients subscribe to: Kafka Stream Processor Pipeline Here is the Stream Processor Pipeline’s placement within a StreamSets Topology that allows visualization and monitoring of the end-to-end data flow, with last-mile pipelines moving the refined events into Delta Lake, Snowflake, Elasticsearch, S3 and ADLS:Kafka Stream Processor with Multiple Consumers Users can extend the Topology using Transformer for Snowflake to perform push-down ETL on Snowflake using Snowpark, and Transformer on Databricks Spark to perform ETL on Delta Lake Push-down ETL on Snowflake and Databricks Delta Lake
Hello everyone,Welcome to the new StreamSets Community Platform. This Platform will encompass all of our users, customers, partners, and technical employees to assist with peer to peer questions, conversations, and knowledge base articles. If you have platform feedback, suggestions, or ideas please leave a reply within this thread. We are continuing to learn and grown. We hope to see you in the community platform. :)
The time has come to announce our first ever quarterly Community Champion!Just a reminder of what each Champion will receive as a thank you for their efforts. One free StreamSets Academy certification Exam Their very own Champion badge A personal highlight piece Custom SWAG! Drum roll please…! Our first ever Community Champion, by unanimous deliberation, is Dash Desai! Congratulations to @Dash. Dash has been a key figurehead in the StreamSets community for over 4 years and spearheaded the Demos with Dash and snack video series. Dash is also the co-host of the Sources and Destinations podcast. What's New Our Community has GROWN! We now have over 300 Members, over 220 Knowledge Base articles, and over 380 published topics on community.streamsets.com! Share by publishing your very own StreamSets articles HERE, showcase your usage, architecture, and know-how of the StreamSets DataOps Platform. Watch out! They just might end up seen as a Knowledge Base article as well. Shar
Scenario : Processing data in JSON format using Encrypt/decrypt processor and loading it to a destination. Steps to implement the pipeline without receiving Invalid ciphertext type error.1.First Pipeline reads data in JSON format sends it to the Encrypt/decrypt processor with action as Encrypt and load it to Kafka Producer with data format as JSON.Few columns are encrypted using Amazon KMS and fed to Kafka producer. 2.Second pipeline reads data from Kafka Consumer with data format as JSON.Field Type convertor is used to converts columns to Byte Array before feeding data into Encrypt/Decrypt processor because the Encrypt/Decrypt processor will only accept columns with data type as Byte Array.Columns that are converted to Byte Array are sent one by one into a Base64 Field Decoder. Please note that We can pass one field per Base64 Field Decoder.From the Base64 Field Decoders the columns are sent to Encrypt/decrypt processor with action as decrypt and finally the data is written to a file
Ordnance Survey (OS) produce a suite of products for the UK Market, One of these is Address Base Premium (ABP), Which is a set of details that describe buildings, businesses, residential and items you’d find on a map in detail, such as Lat/Long data, how the royal mail refers to it, and how the local authorities/councils describe and classify those items.Sounds good right? Well its not so nice of a set of data to deal with, It is shipped in 5 km batches of data in a single csv file, across 10 different schema patterns within it.. Nightmare? Nope! You know the secret of streamsets!Streamsets doesn't care about schema on read, That is the key to unlocking this… In the ABP Files, the first column has a record identifier, This tells us which schema and rules to process.This pipeline makes such a difficult issue to deal with normally, clear and transparent.You can watch a record come in from the raw file, read it with no header, Look at the first column. Process that to a given lane, That l
Hello everyone, Welcome to the StreamSets Community. I am excited to help empower members as they continue their journey to learn, share, and grow their knowledge to succeed each day. I want to introduce myself, and I hope to read more about you too!I am Drew Kreiger. I recently started as the Senior Community Manager here at StreamSets in April of 2021. In the past 5 years, I have previously worked with communities at Talend and now called Redis, where I managed community meetups, education programs, hackathons, forums, and many other great programs with fantastic community members.I enjoy working with community members as I enjoy helping users overcome issues/challenges. I also enjoy working with users on community content and seeing the impact a blog, podcast, KB article, and or event has made within the community. A fun fact about me. During college, I had studied to become a sommelier. Cheers!
This pipeline is designed to orchestrate initial bulk load to change data capture workload from on-prem Oracle database to Snowflake Data Cloud. The pipeline takes into consideration the completion status of the initial bulk load job before proceeding to the next (CDC) step and also sends success/failure notifications. The pipeline also takes advantage of the platform’s event framework to automatically stop the pipeline when there’s no more data to be ingested.
Hi Everyone, Thank you to those who completed the survey. This has been really helpful to understand what we are doing an awesome job on and where we can improve. We wanted to share the responses and pulse of the community today, which you’ll find in the representative responses below. We will be conducting these surveys on a quarterly basis throughout the year. You can find them on the right side of each community page labeled as “Feedback”. (REPORT) Short Answer Question #1: Thank you for your honest feedback. How can we make the community more helpful for you? Responses With more documentation of how we implement for streaming and transformer Here are some helpful docs and academy course. Not able to open ask.streamsets.com As ofDec 8, 2021ask.streamsets.com has been sunset. Going forward Community.streamsets.com is the go-to community forum and knowledge base managed and hosted by StreamSets. We have migrated the top ask.streamsets.com to our new platform and ungat
We have launched a new program called Meetsy! A program around bringing each other closer to connect and learn from each other through automated 1:1 meetings. StreamSets technical employees will be joining this program! This is a great way to ask questions and learn from StreamSets experts! Sign up here! *Make sure to fill out the questioner as the tool better aligns members with similar goals and wants from the program. You may want to join our StreamSetters community Slack for easier notification etc. I hope to see you part of the Meetsy program! Please ask your questions within this thread.
I wanted to have a way to send chess game information from lichess.org (a popular chess server) to Elasticsearch and Snowflake to let me visualize statistics (e.g., how often does World Champion GM Magnus Carlsen lose to GM Daniel Naroditsky?) as well as to generate reports in Snowflake (e.g., when Magnus does lose, which openings does he tend to play?). I ended up accomplishing this with two pipelines: This pipeline is a batch pipeline that ingests data using the lichess REST API and pushes it to a Kafka topic. It uses an endpoint that allows for pulling down all games by username, so I’ve parameterized the username portion and then can use Control Hub jobs to pick which specific users I want to study. This pipeline consumes game data from my Kafka topic, does some basic cleanup of the data (adds a field for the name of the winner rather than the color of the winning pieces and converts timestamps from long to datetime) as well as some basic enrichment (it adds a field that calculates
This pipeline is designed to capture inserts and updates (SCD Type II) being uploaded to a bucket on Amazon S3 for a slowly changing dimension table -- Customers. The pipeline creates new records with version set to 1 for new customers and with version set to (current version + 1) for existing customers. The customer records are then stored in Snowflake Data Cloud.
Data comes from a monitor device, with test results of different elements. Those elements have 3 values, their ID, Their Value and any error messages.The customer wants to see a flattened list of just Monitor Device and timestamp with a result of each element on seperate lines.Using the above we can achieve that, we import the data ignoring the header line. (hence labeling them with numbers) First we label the monitor “Parent” fields by using a field renamer We build a Empty map for us to correctly parse the records using expression evaluator we use a field mapping processor, to map those groups of 3 fields ( checking we only remap columns that are numeric) now we have groups, we split the groups into records using a field pivot processor This leaves a single group per record but as a group, Lets tidy that up with a field flattener so all the records are at the same level Finally we use a field renamer to label the 3 fields we have produced We ship that off to our secure storage faci
Join and listen to our latest episode of our Sources and Destinations podcast from our hosts, @iamontheinet and Sean Anderson.S&D is a podcast about data engineering and data science talking about common design patterns and best practices. Listen where ever you get your podcasts. https://linktr.ee/sourcesanddestinations
We want to make sure your questions are seen and answered. If you ask your question the right way, we will accomplish answering your question fast and precisely. Here are some tips: 1. Before you ask, search first! Make sure to search your question first. The search icon/bar will always be found: At the top of the home page. Within creating a new topic. Within a topic post near your profile photo (Right Top Corner) 2. Don't hesitate ASK! This is a judge-free community. We are all here to support others and build our StreamSets knowledge. 3. Keep your data yours 🧑We cannot stress this enough. Do not share any persons' or your personal information (Email, Phone Number, Address, banking info, etc.) in a screenshot, post, messages, or anywhere on the StreamSets Community Platform. 4. Provide all information Please be concise with your topic/ question Title. Regarding your topic/ question description, Please be sure to provide as much detail; Platform, Version, Screenshots, Categories
We built this pipeline to compare the images received from facebook and twitter to determine if the 2 different profiles belongs to same customer ordifferent.We have a business case where we pull the social media comments from different users received from facebook and twitter. Apart from carrying out the sentiment analysis, we also need to know if same customer is being vocal in multiple platforms. Since customers normally don’t share their PII data, when we pull comments along with profile images from facebook and twitter, we store the images and consider the image matching as one of the criteria to determine the probability score of two comments received from different platfroms belongs to the same person or not.Before running the above pipeline, we have the pre-processed data with some probability score where the name, city, age, sentiments are already matched to come up with the best set of result to compare that is available in a database.In the above pipline,the origin is a MySQ
Streamsetters! We did not have anyone share Industry News / Helpful Reads for our 3rd edition. It's as easy as adding links below. Anything is game. No wrong answers! Here are a few ideas to get you going. Please share your links within this thread: Industry Blogs and Articles. Events others in the community others should join. Exciting personal millstones SS use cases and how others could benefit from your findings. Other forms of media you think the community will benefit from. :) Let's make this a collaborative newsletter, everyone! -Drew Kreiger
ENV:-SDC is integrated with CDH/CDP. Kerberos is enabled. start event script/end event scripts are being used to get the kerberos credentials.ISSUE:-start event script/end event scripts try to kinit and end up with following error message:-ERROR stderr: kinit: Pre-authentication failed: Permission denied while getting initial credentials Resolution:-By default CDH/CDP enabled clusters has set environment variable KRB5CCNAME add following line to at the beginning of the script to unset the variable temporarilyunset KRB5CCNAME
Read 911 Event Data and Load to SnowflakeThis pipeline uses the HTTP client to read real time 911 event data from the Seattle Fire Department. The pipeline then flattens a nested JSON object It also uses the latitude and longitude of the incident location to lookup the Census block information from the Census Bureau’s geocoding API.Fun fact: While building this pipeline, an event came through for a dumpster fire currently happening on my block !
This pipeline is designed to handle (embrace!) data drift while ingesting web logs from AWS S3 and then transforming and enriching them before storing the curated data in Snowflake Data Cloud. The data drift alert is triggered if/when the data being ingested is missing a key field IP Address which is crucial for downstream analytics.
Hi everyone, You might be asking, what is the difference with a question and conversation? That is a great question.A conversation topic is used when you want to share something and involve the community into a discussion.A question topic is used when you need a solution for your question or problem from your community peers. I hope this helps. Lets go create questions and conversations!
Problem StatementVery often we find duplicate records being sent by customer while exporting data from their DWH and sending to us. To ensure data quality, we need to filter out the duplicate records before they enter into our platform for processing, computation and analysis.SolutionWe have addressed this using 2 separate pipelines.Pipeline-1: Files arriving in S3 gets picked up by this pipeline where for every record we calculate the MD5 Hash and make an entry in REDIS with KEY as “The MD5 hash value of the record” and VALUE as “Filename-RowNumber” of the current record being processed. Pipeline-2: The same file processed by Pipeline-1, again gets picked by Pipeline-2 and for every record it calculates the MD5 Hash and checks it’s value in REDIS to determine which RowNumber we need to process and which one we should drop as it’s a duplicate record.Pipeline-1: Pipeline-1: Calculating Hash and Inserting to RedisAs you may see above:CSV file picked from S3 origin, gets converted to JSO
In this pipeline we solve some fun data problem with the help of StreamSets Transformer . Transformer execution engine runs data pipelines on Apache Spark. We can run this pipeline on any spark cluster type Problem Statement: Find the average number of friends for each age and sort them in ascending order.We are given fake friend dataset of social networking platforms in CSV file format is stored in Google cloud storage id, Name, Age, Number of Friends0,Will,33,3851,Jean-Luc,26,22,Hugh,55,2213,Deanna,40,4654,Quark,68,215,Weyoun,59,3186,Gowron,37,2207,Will,54,307
Already have an account? Login
Login to the community
No account yet? Create an account
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.