Custom Scala Project for StreamSets Transformer

Dash Desai
4 min readNov 11, 2019

StreamSets Transformer is a powerful tool for creating highly instrumented Apache Spark applications for modern ETL. It is the newest addition to the StreamSets DataOps Platform. With StreamSets Transformer you can quickly start leveraging all the benefits and power of Apache Spark with minimal operational and configuration overhead. It provides enterprises with the flexibility to create ETL pipelines for both batch and streaming data and also gives clear visibility into their data processing operations and performance across both cloud and on-prem systems.

In addition, StreamSets Transformer also enables you to extend its functionality. This tutorial explains how to create a custom Scala project and import the compiled jar into Transformer.

Prerequisites

Note: Transformer includes the Spark libraries required to preview dataflow pipelines. You will need an Apache Spark 2.3 (or higher) distribution to run the pipeline.

Create Scala Project

In your favorite IDE, create a new Scala project.

Make sure you’ve selected the right versions for JDK and Scala.

Once the project is successfully created, it should have the following structure.

Next we will need to add support for Scala framework. If you’re using IntelliJ, right-click on the project and select Add Framework Support… and enable Scala.

Now we’re ready to create a new package and add our Scala class.

The main Scala object is called Demo which implements main(args: Array[String]), printStrings( args:String* ) and hello(name: String = "Transformer") methods. Since the focus of this tutorial is to illustrate how to import custom jars written in Scala into Transformer, we'll keep the method implementations to a minimum :)

package com.streamsets.dash

object Demo {
def main(args: Array[String]) {
printStrings("Hello", "World", "Scala");
}

def printStrings( args:String* ) = {
var i : Int = 0;

for( arg <- args ){
println("Arg value[" + i + "] = " + arg );
i = i + 1;
}
}

def hello(name: String = "Transformer") = {
"Hello " + name
}
}

Now build the project from the root folder by running sbt package.

$ sbt package 
[info] Loading settings for project global-plugins from idea.sbt ... [info] Loading global plugins from /Users/dash/.sbt/1.0/plugins [info] Loading project definition from /Users/dash/Apps/StreamSets/Transformer_Custom_Scala_Project/sample_project/project [info] Loading settings for project sample_project from build.sbt ... [info] Set current project to SampleProject (in build file:/Users/dash/Apps/StreamSets/Transformer_Custom_Scala_Project/sample_project/) [success] Total time: 1 s, completed Nov 7, 2019 1:16:27 PM

You should see jar file sampleproject_2.11-0.1.jar created in the target/scala-2.11/ folder.

Create StreamSets Transformer Pipeline

Since the focus of this tutorial is to illustrate how to import custom jars written in Scala into StreamSets Transformer, we’ll keep the pipeline definition itself to a minimum :)

Step 1. Click on Create New Pipeline button, enter a name and click Save.

Step 2. For “Select Origin…” select Dev Raw Data Source.

Step 3. For “Select Processor to connect…” select Scala.

Step 4. For “Select Destination to connect…” select Trash.

Step 5. Select Scala processor and click on External Libraries in the bottom pane to install sampleproject_2.11-0.1.jar under Basic Stage Library.

Step 6. Restart Transformer.

Step 7. Select Scala processor > Scala tab and replace exising code with the following code snippet.

In this example code snippet, we’re ignoring input data and creating a new output dataframe with values returned by calling imported com.streamsets.dash.Demo object's methods found in the external library -- sampleproject_2.11-0.1.jar.

Step 8. Click on Preview and you should see the following output.

Conclusion

It’s pretty straightforward to build custom Scala projects and import the jars in StreamSets Transformer, particularly if you already have Apache Spark and Scala development skills.

The complete sample project can be found here.

Originally published at https://github.com.

--

--

Dash Desai

Lead Developer Advocate @ Snowflake | AWS Machine Learning Specialty | #DataScience | #ML | #CloudComputing | #Photog