Ingest Salesforce Data Into Amazon S3 Data Lake

3 min readNov 5, 2020

In this blog, you will learn how to ingest Salesforce data using Bulk API (optimized to process large sets of data) and store it in Amazon Simple Storage Service (Amazon S3) Data Lake using StreamSets Data Collector, a fast data ingestion engine . The primary AWS service used in our data pipeline is Amazon S3, which provides cost effective storage and archival to underpin the data lake.

Consider the use case where a data engineer is tasked with archiving all Salesforce contacts along with some of their account information in Amazon S3. To demonstrate an approach of connecting Salesforce and AWS, I have created a data pipeline that is specifically designed to facilitate seamless, secure, and real-time flow of data between Salesforce and Amazon S3.

Pipeline Overview And Implementation

Let’s deep dive into our data pipeline implementation.

Salesforce origin

You can configure the Salesforce origin to read existing data using the Bulk or SOAP API and provide the SOQL query, offset field, and optional initial offset to use. When using the Bulk API, you can enable PK Chunking to efficiently process very large volumes of data.
The Salesforce origin is also capable of performing a full or incremental read at specified intervals.
The origin can also be configured to subscribe to notifications to process PushTopic, platform, or change data capture change events.
In our case, the origin is configured to ingest existing contacts information using Salesforce Object Query Language (SOQL) in Bulk API mode.
SOQL used to retrieve contacts — “Select Id,AccountId,FirstName,LastName,LeadSource,Email FROM CONTACT WHERE Id > ‘${OFFSET}’ Order By Id”
For details on additional configuration, refer to the documentation.

Salesforce Lookup processor

This processor is configured to perform a lookup against Salesforce to retrieve additional information and enrich data before storing it in Amazon S3.
In particular, based on AccountId associated with the contact, it’s retrieving AnnualRevenue, AccountSource, and Rating for that account.
For details on additional configuration, refer to the documentation.

Field Masker processor

This processor is configured to mask PII (contact’s email address) before storing the data in Amazon S3.
For details on additional configuration, refer to the documentation.

Schema Generator processor

This processor is configured to automatically generate Avro schema based on the structure of contacts records.
This enables writing data in a compressed (Avro) format for cost effective storage in Amazon S3.
For details on additional configuration, refer to the documentation.

The Amazon S3 is configured to to store the contacts data in compressed, Avro format.
It is also configured to use AWS Server-Side encryption (SSE) to protect and secure contacts data written to Amazon S3.
For details on additional configuration, refer to the documentation.

Pipeline Run

After the pipeline runs successfully, you should see the output similar to the one shown below. Notice the highlighted AWS encryption and data format of the object stored on Amazon S3.

And the contents of the S3 object stored in Avro format should look something like this.

In this post, you learned the value companies can realize by leveraging and integrating data between AWS and Salesforce using StreamSets Data Collector. Closer integration between AWS and Salesforce opens up plenty of opportunities for enterprises to develop new and unique ways of accessing, analyzing, and storing their data.

Originally published at https://streamsets.com on November 5, 2020.

Ingest Salesforce Data Into Amazon S3 Data Lake

Pipeline Overview And Implementation

Pipeline Run

Written by Dash Desai