Imagine asking Amazon Alexa or Google Home to run your ETL, data processing, and machine learning “, or “ data pipelines . For example, “ Start my data pipeline on Amazon EMR”, “ How many active jobs do I have running on Databricks?” — all without firing up your computer or asking an admin or your favorite data engineer to do the work for you. Stop my machine learning pipeline on Snowflake

The concept of conversing with a computer is very interesting and has been around for a while — think Star Trek’s “LCARS” and Hal from “A Space Odyssey”…

Change Data Capture is becoming essential to migrating to the cloud. In this blog, I have outlined detailed explanations and steps to load Change Data Capture (CDC) data from PostgreSQL to Amazon Redshift using StreamSets Data Collector, a fast data ingestion engine .

The data pipeline first writes PostgreSQL CDC data to Amazon S3 and then executes a set of queries to perform an upsert operation on Amazon Redshift. …

Learn how to load data from S3 to Snowflake and serve a TensorFlow model in StreamSets Data Collector, a fast data ingestion engine , data pipeline for scoring on data flowing from S3 to Snowflake.

Data and analytics are helping us become faster and smarter at staying healthy. Open data sets and analytics at cloud scale are key to unlocking the accuracy needed to make real impacts in the medical field. Data Clouds like Snowflake are prime enablers for opening up analytics with self-service data access and scale that mirrors the dynamic nature of data today. One prominent pattern and…

What is Google Dataproc?

Dataproc is a low-cost, Google Cloud Platform integrated, easy to use managed Spark and Hadoop service that can be leveraged for batch processing, streaming, and machine learning use cases.

What is Google BigQuery?

BigQuery is an enterprise grade data warehouse that enables high-performance SQL queries using the processing power of Google’s infrastructure.

Load Data Into Google BigQuery and AutoML

In this blog, we will review ETL data pipeline in StreamSets Transformer, a Spark ETL engine , to ingest real-world data from Fire Department of New York (FDNY) stored in Google Cloud Storage (GCS), transform it, and store the curated data in Google BigQuery.

Once the transformed data is made available in…

In this blog, you will learn how to ingest Salesforce data using Bulk API (optimized to process large sets of data) and store it in Amazon Simple Storage Service (Amazon S3) Data Lake using StreamSets Data Collector, a fast data ingestion engine . The primary AWS service used in our data pipeline is Amazon S3, which provides cost effective storage and archival to underpin the data lake.

Consider the use case where a data engineer is tasked with archiving all Salesforce contacts along with some of their account information in Amazon S3. To demonstrate an approach of connecting Salesforce and…

Learn how StreamSets, a modern data integration platform for DataOps , can help expedite operations at some of the most crucial stages of Machine Learning Lifecycle and MLOps.

Data Acquisition And Preparation

Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. …

In this blog, we will review how easy it is to set up an end-to-end ETL data pipeline that runs on StreamSets Transformer to perform extract, transform, and load (ETL) operations. The pipeline uses Apache Spark for Azure HDInsight cluster to extract raw data and transform it (cleanse and curate) before storing it in multiple destinations for efficient downstream analysis. The pipeline also uses technologies like Azure Data Lake Storage Gen2 and Azure SQL database, and the curated data is queried and visualized in Power BI.

StreamSets Transformer is an execution engine that runs on Apache Spark, an open-source distributed…

Learn how quickly you can start ingesting and aggregating clickstream logs using StreamSets Transformer running on Amazon EMR, and see how the data is analyzed in Elasticsearch, Kibana, and Amazon Redshift.

Clickstream analysis by definition is the process of collecting, analyzing, and reporting aggregate information about webpage visits. StreamSets Transformer is an execution engine that runs on different “flavors” of Apache Spark including Amazon EMR, Hadoop, Databricks, and SQL Server 2019 Big Data Cluster. For a full list, see installation requirements.

Pipeline Overview

Here are the details of the dataset and pipeline components:

  • Transformations: Include aggregations, such as:
  • Number of views for…

Learn how to load a serialized Spark ML model stored in MLeap bundle format on Databricks File System (DBFS), and use it for classification on new, streaming data flowing through the StreamSets DataOps Platform.

In my previous blogs, I illustrated how easily you can extend the capabilities of StreamSets Transformer using Scala and PySpark. If you have not perused blogs train Spark ML Random Forest Regressor model, serialize the trained model , train Logistic Regression NLP model , I highly recommend it before proceeding because this blog builds upon them.

Ok, let’s get right to it!

Streaming Data: Twitter to Kafka

I’ve designed this StreamSets…

Recently I attended an inspirational tech talk hosted by Databricks where the presenters shared some great tips and techniques around analyzing COVID-19 Open Research Dataset (CORD-19) freely available here. As stated in its description:

“In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.”

In this blog post, I’m sharing some of the analysis presented during the tech talk which I have now replicated using…

Dash Desai

Director of Platform, Technical Evangelism @ StreamSets | #DataScience | #MachineLearning | #BigData | #CloudComputing | #Travel | #Photography @natureunraveled

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store