Streaming Analytics in Google Cloud Platform (GCP) - Introduction ~ Technology blog by Rathish kumar

Streaming Analytics in Google Cloud Platform (image source - pixabay)

From data-to-decision in real-time

Welcome to our new series on building a streaming analytics system in the Google Cloud Platform!. Let's begin with a quick introduction. Streaming analytics is the process of analysing data in real-time as it is received. Streaming analytics enables an organisation to gain insights and make decisions based on the most up-to-date data, in real time. This is crucial for business as it allows organisations to respond to changes and opportunities in a timely manner.

For example, if an organisation is able to detect and respond to changes in customer behaviour or market conditions in real time, it can adjust its strategies and tactics to better meet the needs of its customers and take advantage of emerging opportunities. A streaming analytics system in a retail company can detect when a particular product is becoming more popular among customers and alert the company to this trend in real time. The company is able to respond to this trend by increasing its orders for the product and adjusting its inventory to meet growing customer demand. Also, the company can learn the competitor's actions and adjust product prices dynamically based on the supply. So with real-time insights, a retail company can avoid running out of stock (lost sales) and also takes advantage of the increased demand and dynamic pricing to drive up it is revenue.

I hope this example helps you with understanding the benefits of streaming analytics. There are many other use cases such as fraud detection, performance monitoring, personalisation, etc. where streaming analytics is changing the business. With this basic understanding, in this practical guide, we'll walk you through the process of using Cloud Pub/Sub for data ingestion, Apache Beam pipelines on Dataflow for data processing, BigQuery for storage and analysis, and Looker Studio for data visualisation. Whether you're a seasoned GCP user or new to the platform, this series is designed to provide you with the skills and knowledge you need to build a powerful and efficient streaming analytics system

We'll be using the Google Cloud SDK (gcloud CLI tool), Python SDK, and standard SQL to deploy resources, process and analyse data, and more. In the first part of this series, we'll provide a simple introduction to the relevant services, and in the subsequent articles, we'll delve into installing libraries, designing and deploying pipelines, performing aggregations, and more. Let's get started with the introduction!

Cloud Pub/Sub

Google Cloud Pub/Sub (Source: GCP)

Pub/Sub is a messaging service, which allows us to send messages between applications and it is based on the Publisher/Subscriber (pub/sub) pattern. In a distributed system, there will be multiple services running independently, and there must be some way to communicate between these services, and Pub/Sub provides this capacity. There are two components, Topic & Subscription. A message producer application publishes messages on a Topic and one or more subscriber applications express interest in that topic and receive messages through Subscriptions.

In a Pub/Sub pattern, the publisher and subscriber are independent of each other. Before sending a message, a publisher application does not need to know who is going to/ how many clients are going to receive the messages. And similarly, the subscriber application may not know about the existence of the publisher application. The pub/sub decouples subscribers from the publisher and allows us to develop, deploy and scale the applications independently.

Cloud Pub/Sub is a fully managed service, which means you don’t need to worry about the infrastructure - with just a few clicks, you will be able to create a pub/sub service and ready to send/receive messages. It is used for real-time, many-many, asynchronous messaging.

Apache Beam

Apache Beam (Source: beam)

Beam stands for Batch + Stream. It is an open-source unified programming model for defining and executing data pipelines. The main advantage of Beam over other data processing frameworks is its portability.

First, Apache Beam provides the flexibility to express the data processing logic once and execute it for both batch and stream processing, which means, you write a pipeline once and it is portable to handle both batch and streaming data processing without changing your code.

Second, you can programmatically define your pipeline using one of the supported languages. Beam provides a number of language-specific SDKs including Java, Python, SQL and Go. This allows you to create your pipeline in any language you are comfortable with, and Beam’s Portability Framework takes care of the execution on runner internally. Also, this enables us to develop cross-language transforms and multi-language pipelines.

Third, flexibility to run your pipeline in multiple runners (execution environments). The Beam model is designed to be independent of the underlying execution engine, allowing the same Beam pipeline to be executed on different platforms without changing your code. The Beam supports many runners including Google Cloud Dataflow, Apache Flink, Apache Spark and Apache Samza.

One of the key benefits of the portability provided by Apache Beam is that it can help to avoid vendor lock-in. As the code is not tied to a specific vendor’s technology or platform, it can be migrated to another runtime engine, if necessary. This can provide significant benefits in terms of flexibility, and long-term maintainability of your application. Keep this point in mind, when deciding your next data processing solution.

Cloud Dataflow

GCP Cloud Dataflow (Source: GCP)

Cloud Dataflow is a fully-managed serverless data processing service. It provides the execution environment for Apache Beam data pipelines.

Scalability is an important consideration in a distributed system, as a fully-managed service Dataflow takes care of the underlying infrastructure and is designed to handle data processing at scale, it can automatically scale up and down as needed to meet the pipeline demands, and this reduces manual interventions.

As Dataflow is a part of the Google Cloud Platform, it can be easily integrated with other Google Cloud Services such as Cloud Pub/Sub, BigQuery., etc. This is super helpful when you are building an end-to-end analytics system in the Google Cloud Platform.

Refer to the official documentation for complete features provided by Cloud Dataflow.

BigQuery

Google BigQuery (Source - GCP)

BigQuery is a serverless data warehouse service provided by the Google Cloud Platform. It is a fully managed service, you simply load your data and start querying it using SQL. You can interact with BigQuery using various tools, a simple one to start with BigQuery web UI, also there are client libraries as well, including bq command line tools and Python, Java and Go client libraries.

BigQuery is highly scalable and handles large volumes of data efficiently. To achieve low latency and reduce query cost, internally BigQuery uses Columnar Storage for storing column-wise data, distributed query execution for shuffling data across multiple workers/nodes and data partitioning. Also, at the user level, you can use partitioning and clustering options to improve performance and reduce costs.

BigQuery provides many interesting features including real-time analytics and built-in machine learning. Also, it is easy to integrate with other Google Cloud services.

These are the main services, we will be using in this series, we might use a few more services such as Google Cloud Storage, and Cloud Functions, etc. I will add a short introduction to these services at the time of implementation.

Part 1 is in the books, but the fun is just getting started. In our next instalment, we'll be tackling the important task of installing the SDKs and client libraries needed for this project. Keep an eye out for it, and have a great day in the meantime!

References

Streaming data: https://en.wikipedia.org/wiki/Streaming_data
Publisher-Subscriber pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/publisher-subscriber
Cloud Pub/Sub: https://cloud.google.com/pubsub
Apache Beam: https://beam.apache.org/about/
Cloud Dataflow: https://cloud.google.com/dataflow
BigQuery: https://cloud.google.com/bigquery
Columnar Storage: https://cloud.google.com/blog/topics/developers-practitioners/bigquery-explained-storage-overview