Clickstream Analysis on Amazon EMR, Amazon Redshift and Elasticsearch

Dash Desai
3 min readApr 23, 2020

Learn how quickly you can start ingesting and aggregating clickstream logs using StreamSets Transformer running on Amazon EMR, and see how the data is analyzed in Elasticsearch, Kibana, and Amazon Redshift.

Clickstream analysis by definition is the process of collecting, analyzing, and reporting aggregate information about webpage visits. StreamSets Transformer is an execution engine that runs on different “flavors” of Apache Spark including Amazon EMR, Hadoop, Databricks, and SQL Server 2019 Big Data Cluster. For a full list, see installation requirements.

Pipeline Overview

Here are the details of the dataset and pipeline components:

  • Transformations: Include aggregations, such as:
  • Number of views for each session with respect to action for a specific URL
  • The total number of sessions for each client IP address
  • Number of events captured for each brand of products
  • Destinations:
  • Aggregations are stored in Amazon Redshift tables. (Note: if the tables don’t already exist, the destination can be configured for the tables to be auto-created.)
  • All the logs are sent to Elasticsearch for searching and quick visualizations in Kibana. (Note: if the index doesn’t already exist in Elasticsearch, the destination can be configured for the index to be auto-created.)

Aggregations

Here are the aggregated stats being collected and stored in Amazon Redshift.

Elasticsearch and Kibana

Once the logs were available in Elasticsearch, I created an index pattern called clickstream_data with all the attributes of the logs.

Using the clickstream_data index pattern as source, I then created a dashboard with different visualizations in Kibana.

  • Session Wise Analysis — Number of views for each session with respect to action for a specific URL
  • Client Wise Analysis — The total number of sessions for each client IP address.
  • Brand Analysis — Number of events captured for each brand of products
  • HTTP Response Analysis — Number of events captured with a response status such as Successful, Request Lost, No Response, Error Response, etc.

Querying Amazon Redshift

Aggregations are stored in number_of_views_per_session, number_of_sessions_per_ip, and number_of_events_per_http_response Redshift tables for faster-querying. For example, query to see the top 5 IP addresses from where the HTTP sessions were initiated.

SELECT distinct(clientip),total_sessions from number_of_sessions_per_ip order by total_sessions desc limit 5

Sample Pipeline and Dataset on GitHub

If you’d like to get a head start and/or build upon this pipeline, download sample pipeline and sample dataset from GitHub.

After importing the sample pipeline, update pipeline parameters with Amazon EMR credentials and other configuration details, such as, Amazon S3 buckets, Amazon Redshift endpoint, and Elasticsearch URL before running the pipeline.

Enjoy!

Summary

While there are different ways to dissect and analyze data, hopefully this blog and demo video gives you ideas on how to use some of these tools you might have at your disposal in order to make better, data-driven decisions, faster.

Learn more about StreamSets For AWS and StreamSets Transformer.

Originally published at https://streamsets.com on April 23, 2020.

--

--

Dash Desai

Lead Developer Advocate @ Snowflake | AWS Machine Learning Specialty | #DataScience | #ML | #CloudComputing | #Photog