Welcome

This pipeline is populating and updating a Slowly Changing Dimension for the NYC Citibike stations using Transformer for Snowflake. The pipeline is using StreamSets’ Slowly Changing Dimension stage that handles all of the logic for Type 1 and Type 2 dimensions. The SCD stage then produces the correct output along with the metadata that the Snowflake Destination uses to perform the merge.

Build a Slowly Changing Dimension with Transformer for Snowflake

Drew KreigerSenior Community Builder at StreamSets

Happy New Years everyone! I want to say thank you to each of you for all your hard work and contributions to our StreamSets Community. We are close to welcoming our 300th member! Taking a look back at last year we launched our new community platform with the addition of knowledge base articles, a monthly community-led newsletter, bi-annual member feedback survey, a new and improved StreamSets Academy, and much much more. Looking ahead into 2022 we are excited to launch; Virtual meetups, Pipelines and patterns examples, and much more from the feedback we received from the member feedback survey. (Click image to expand)

StreamSets Community Wrap Up 2021! 🎉

Anonymous

Problem StatementYou’ve been asking everybody you meet the age old question: What is your favorite dinosaur? You’ve been saving their responses in a series of delimited files that have this format. Tragically, the data you have meticulously gathered is dirty. Before you can execute your master plan to do whatever is that you were planning to do with this information, you must clean your dataset and resave it to a single clean file. The PipelineYou read in the Dino data with the Directory origin. You remove duplicates on the Dinosaur column using the Record Deduplicator You filter Dinos that have murdered people in the Jurassic Park film series for reasons that are you own using the Stream Selector. Finally, you save your Dino data to one single file on your local system. Your data is clean, filtered and ready for analysis. If you’re reading this, please comment with your favorite dinosaur. It’s uh...for science.

What's Your Favorite Dinosaur? - StreamSets Pipeline to Clean Delimited Files

antmcmullenStreamSets Employee

Ordnance Survey (OS) produce a suite of products for the UK Market, One of these is Address Base Premium (ABP), Which is a set of details that describe buildings, businesses, residential and items you’d find on a map in detail, such as Lat/Long data, how the royal mail refers to it, and how the local authorities/councils describe and classify those items.Sounds good right? Well its not so nice of a set of data to deal with, It is shipped in 5 km batches of data in a single csv file, across 10 different schema patterns within it.. Nightmare? Nope! You know the secret of streamsets!Streamsets doesn't care about schema on read, That is the key to unlocking this… In the ABP Files, the first column has a record identifier, This tells us which schema and rules to process.This pipeline makes such a difficult issue to deal with normally, clear and transparent.You can watch a record come in from the raw file, read it with no header, Look at the first column. Process that to a given lane, That l

Ordinance Survey - Address Base Premium Processing mutli-schema documents easily

2 months ago

Drew KreigerSenior Community Builder at StreamSets

Hello everyone,Welcome to the new StreamSets Community Platform. This Platform will encompass all of our users, customers, partners, and technical employees to assist with peer to peer questions, conversations, and knowledge base articles. If you have platform feedback, suggestions, or ideas please leave a reply within this thread. We are continuing to learn and grown. We hope to see you in the community platform. :)

Community Platform: Feedback

3 months ago

DashSenior Technical Evangelist and Developer Advocate at Snowflake

This pipeline is designed to ingest streaming data from Kafka and load a trained ML model in Scala custom processor to predict sentiment of tweets. The pipeline runs on Databricks cluster and stores the tweets along with its score in Delta Lake.

Real-time scoring using ML model

Drew KreigerSenior Community Builder at StreamSets

We have launched a new program called Meetsy! A program around bringing each other closer to connect and learn from each other through automated 1:1 meetings. StreamSets technical employees will be joining this program! This is a great way to ask questions and learn from StreamSets experts! Sign up here! *Make sure to fill out the questioner as the tool better aligns members with similar goals and wants from the program. You may want to join our StreamSetters community Slack for easier notification etc. I hope to see you part of the Meetsy program! Please ask your questions within this thread.

Meet Meetsy! Connect & Learn from Each Other with 1:1 Meetings

1 year ago

swayamDiscovered Fame

We built this pipeline to compare the images received from facebook and twitter to determine if the 2 different profiles belongs to same customer ordifferent.We have a business case where we pull the social media comments from different users received from facebook and twitter. Apart from carrying out the sentiment analysis, we also need to know if same customer is being vocal in multiple platforms. Since customers normally don’t share their PII data, when we pull comments along with profile images from facebook and twitter, we store the images and consider the image matching as one of the criteria to determine the probability score of two comments received from different platfroms belongs to the same person or not.Before running the above pipeline, we have the pre-processed data with some probability score where the name, city, age, sentiments are already matched to come up with the best set of result to compare that is available in a database.In the above pipline,the origin is a MySQ

Carrying out image comparision using SDC pipeline and Amazon Image Rekognition

DashSenior Technical Evangelist and Developer Advocate at Snowflake

This pipeline is designed to capture inserts and updates (SCD Type II) being uploaded to a bucket on Amazon S3 for a slowly changing dimension table -- Customers. The pipeline creates new records with version set to 1 for new customers and with version set to (current version + 1) for existing customers. The customer records are then stored in Snowflake Data Cloud.

Slowly but surely!

asked in Show us your Pipelines

HIMANSHU_SURANAFan

Consuming MySQL binlog data in JDBC producer

I am trying to build CDC pipeline to migrate MySQL database to MySQL database in different server.Here’s the data collector pipeline I’ve created. As per the documentation here, JDBC producers should be able to process binlog data. I’ve used Field remover to only use /Data and /Table fields. When I run the pipeline I am getting error that input record has no data for <schema>.<table> . How can I create a pipeline to consume binlog records?

8 months ago

Drew KreigerSenior Community Builder at StreamSets

posted in Events & Webinars

Sources and Destinations Podcast

Join and listen to our latest episode of our Sources and Destinations podcast from our hosts, @iamontheinet and Sean Anderson.S&D is a podcast about data engineering and data science talking about common design patterns and best practices. Listen where ever you get your podcasts. https://linktr.ee/sourcesanddestinations

asked in Show us your Pipelines

dixit.singlaFan

MongoDB Atlas Connection seems to have a bug.

I was trying to add a MongoDB Atlas connection and on clicking the test connection button, I am getting an error “[unauthorized] not authorized on local to execute command”. I tested the connection string and credentials in Mongo Compass and there it was working fine. So, I could not understand why the test connection is failing. After many tries, I just thought of saving the connection as it is and try it in the pipeline. Magically when used in the pipeline I was able to fetch the data from MongoDB. Then again when I tried the test connection and was getting the same error.Can anyone help me understand this behavior.

2 months ago

Drew KreigerSenior Community Builder at StreamSets

We want to make sure your questions are seen and answered. If you ask your question the right way, we will accomplish answering your question fast and precisely. Here are some tips: 1. Before you ask, search first! Make sure to search your question first. The search icon/bar will always be found: At the top of the home page. Within creating a new topic. Within a topic post near your profile photo (Right Top Corner) 2. Don't hesitate ASK! This is a judge-free community. We are all here to support others and build our StreamSets knowledge. 3. Keep your data yours 🧑‍We cannot stress this enough. Do not share any persons' or your personal information (Email, Phone Number, Address, banking info, etc.) in a screenshot, post, messages, or anywhere on the StreamSets Community Platform. 4. Provide all information Please be concise with your topic/ question Title. Regarding your topic/ question description, Please be sure to provide as much detail; Platform, Version, Screenshots, Categories

5 tips to ask your question the right way

antmcmullenStreamSets Employee

Data comes from a monitor device, with test results of different elements. Those elements have 3 values, their ID, Their Value and any error messages.The customer wants to see a flattened list of just Monitor Device and timestamp with a result of each element on seperate lines.Using the above we can achieve that, we import the data ignoring the header line. (hence labeling them with numbers) First we label the monitor “Parent” fields by using a field renamer We build a Empty map for us to correctly parse the records using expression evaluator we use a field mapping processor, to map those groups of 3 fields ( checking we only remap columns that are numeric) now we have groups, we split the groups into records using a field pivot processor This leaves a single group per record but as a group, Lets tidy that up with a field flattener so all the records are at the same level Finally we use a field renamer to label the 3 fields we have produced We ship that off to our secure storage faci

Split child records that are crosstabbed back into flat records

thomas.bennettStreamSets Employee

This pipeline performs a Reverse ETL process. We take the enriched, coalesced data and then generate the proper JSON payload that is sent to MixPanel to either create or update users.

Reverse ETL (MySQL to MixPanel)

kateStreamSets Employee

Transformer for Snowflake: Pipeline to Create Fact TableThis pipeline uses the NYC Citibike data to populate a fact table enriched with station and weather data. Since this is using Transformer for Snowflake, none of the data leaves the Snowflake Data Cloud since the raw trips data is already loaded in Snowflake! The pipeline generates the SQL that is then executed on Snowflake, inserting the query results into the target tables. The pipeline is using weather data from Weather Source’s free Global Weather & Climate Data for BI data set on the Snowflake Data Marketplace so you can see the weather conditions for each ride.

Using Transformer for Snowflake to create a fact table for Citibike Trips data

kateStreamSets Employee

Read 911 Event Data and Load to SnowflakeThis pipeline uses the HTTP client to read real time 911 event data from the Seattle Fire Department. The pipeline then flattens a nested JSON object It also uses the latitude and longitude of the incident location to lookup the Census block information from the Census Bureau’s geocoding API.Fun fact: While building this pipeline, an event came through for a dumpster fire currently happening on my block !

Load Real-Time 911 Data to Snowflake

Drew KreigerSenior Community Builder at StreamSets

We are excited to announce our new offering: StreamSets Transformer for Snowflake! Learn details about the new offering and about how we're shifting to a higher gear. Gain the insights here, https://blog.softwareag.com/streamsets-snowflake

General availability of StreamSets Transformer for Snowflake is here!

1 year ago

sundeep dhallStreamSets Employee

Streaming data pipeline that potentially has updates to transactions. Look up for matching records in the database (postgres), filter transaction using a condition to determine if transactions are updates vs new. For updates, mask sensitive data and writes them to postgres db table. New transaction are written to an AWS bucket and unprocessed transactions (pipeline errors) are written to a kafka destination for further analysis.

Match and update incoming transactions in a Postgres database (JDBC, Postgres, AWS, Kafka)

asked in Show us your Pipelines

realmatchaOpening Band

Connect SqlServer

Hi, I am new to Streamsets, started using it today.I have a task to create some backup data from sqlserver to hadoop and hive, the jdbc connection has been changed to sqlserver but it still doesn't work, please help me

8 months ago

Drew KreigerSenior Community Builder at StreamSets

Hello everyone, Welcome to the StreamSets Community. I am excited to help empower members as they continue their journey to learn, share, and grow their knowledge to succeed each day. I want to introduce myself, and I hope to read more about you too!I am Drew Kreiger. I recently started as the Senior Community Manager here at StreamSets in April of 2021. In the past 5 years, I have previously worked with communities at Talend and now called Redis, where I managed community meetups, education programs, hackathons, forums, and many other great programs with fantastic community members.I enjoy working with community members as I enjoy helping users overcome issues/challenges. I also enjoy working with users on community content and seeing the impact a blog, podcast, KB article, and or event has made within the community. A fun fact about me. During college, I had studied to become a sommelier. Cheers!

Meet the Community | Introduce yourself

Drew KreigerSenior Community Builder at StreamSets

Hi Everyone, Thank you to those who completed the survey. This has been really helpful to understand what we are doing an awesome job on and where we can improve. We wanted to share the responses and pulse of the community today, which you’ll find in the representative responses below. We will be conducting these surveys on a quarterly basis throughout the year. You can find them on the right side of each community page labeled as “Feedback”. (REPORT) Short Answer Question #1: Thank you for your honest feedback. How can we make the community more helpful for you? Responses With more documentation of how we implement for streaming and transformer Here are some helpful docs and academy course. Not able to open ask.streamsets.com As ofDec 8, 2021ask.streamsets.com has been sunset. Going forward Community.streamsets.com is the go-to community forum and knowledge base managed and hosted by StreamSets. We have migrated the top ask.streamsets.com to our new platform and ungat

Community Member Feedback: Survey Follow Up: Q4 2021

RishiStreamSets Employee

In this pipeline we solve some fun data problem with the help of StreamSets Transformer . Transformer execution engine runs data pipelines on Apache Spark. We can run this pipeline on any spark cluster type Problem Statement: Find the average number of friends for each age and sort them in ascending order.We are given fake friend dataset of social networking platforms in CSV file format is stored in Google cloud storage id, Name, Age, Number of Friends0,Will,33,3851,Jean-Luc,26,22,Hugh,55,2213,Deanna,40,4654,Quark,68,215,Weyoun,59,3186,Gowron,37,2207,Will,54,307

Solving Fun Data problem with Streamsets transformer

DashSenior Technical Evangelist and Developer Advocate at Snowflake

This pipeline is designed to ingest data from Amazon S3 and prepare it for training a ML model using PySpark custom processor. Once the Gradient Boosted model is trained, the model artifacts, features, accuracy of the model and other metrics are registered as an experiment in MLflow. (The pipeline runs on Databricks cluster which comes bundled with MLflow server.)

Train ML Model and register experiment in MLflow

DashSenior Technical Evangelist and Developer Advocate at Snowflake

This pipeline is designed to handle (embrace!) data drift while ingesting web logs from AWS S3 and then transforming and enriching them before storing the curated data in Snowflake Data Cloud. The data drift alert is triggered if/when the data being ingested is missing a key field IP Address which is crucial for downstream analytics.

Embrace Data Drift or get left behind!