Streaming Analytics in GCP (Source: Pixabay)

Hello everyone, in the previous article Streaming Analytics in Google Cloud Platform - Introduction, we have covered what is streaming analytics, what services we are going to use and a quick introduction to each service. In this part of the series, we will begin the installation of SDKs, and libraries and set up our environment.

Having a well-configured development environment is super crucial down the road. I have seen many questions in the stack overflow and other online forums asking questions related to “module not found”, “unable to import packages”, “access denied”. “resource not found”, etc. The problem is not, getting errors and fixing them by asking online, in fact, it is a good practice to ask someone who knows and do research on our own to find solutions to our technical problems, if you look back, usually, the problems which you spend many numbers of hours researching are the ones still in your mind and maybe forever. In my first job, I spent a huge amount in asking questions and researching related to setting up MySQL replication, and the knowledge gained from those times is still fresh in my mind, even after many years. So asking questions is good and doing your own research is very important.

The problem is the time! You will be spending a lot of time on these kinds of questions and waiting for answers, which could be avoided if you simply spend a few minutes setting up your development correctly. I assume you get a clear understanding of why this part is very important, let's begin with installations.

Google Cloud Command-line SDK (gcloud)

gcloud is a command line interface (CLI) tool for the Google Cloud Platform (GCP). It is powerful and flexible, allows you to use the command line to perform various tasks on GCP, such as creating and managing services and interacting with services, and deploying applications. It is a very essential tool for developers working with GCP.

I am going to cover the installation of gcloud CLI for MacOS, official documentation contains the detailed instruction for other operating systems, please refer here if you are using Windows.

Create a folder in a preferred location

mkdir gcp
cd gcp

Download the gcloud SDK

wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-412.0.0-darwin-x86_64.tar.gz

Extract the gcloud archive in the current directory

tar -xzvf google-cloud-cli-412.0.0-darwin-x86_64.tar.gz

Run the installer script to start the gcloud SDK installation

./google-cloud-sdk/install.sh

Once the installation completed, you can start a new terminal window, so that the changes take effect. To verify the installation, run the following command, it will show the similar output as shown below for successful installation

gcloud version

To initialise the gcloud, run the following command

gcloud init

Init command will launches as interactive getting started workflow for the gcloud command line. This command will authorise the gcloud and other SDKs to access to Google Cloud resources and set the current project. This step is must to complete authentication and google cloud resources.

Few useful commands to try:

gcloud init

gcloud version - Print version information for Google Cloud CLI components
gcloud info - Display information about the current gcloud environment
gcloud help - Search gcloud help text
gcloud cheat-sheet - Display gcloud cheat sheet

Example output:

Apache Beam Python SDK

Apache Beam Python SDK support Python 3.6, 3.7 and 3.8. I will be using Python 3.9 - at the time of writing this article, Apache Beam Python SDK does not support Python 3.10. So I am using 3.9 but please note you might not see this 3.9 support information in official documentation and as it is a continuously evolving project, document may be updated any time. Safer options are Python 3.8, for our implementation, Python 3.9 works fine.

To check your current Python version:

python --version

And you need PIP installer to install Apache Beam package, if you do not have one, please install it, steps available here: https://pip.pypa.io/en/stable/installation/

If you are using anaconda, there is a high chance that, PIP is already installed, you can check it by running below command:

pip --version

Create a virtual environment for stream analytics project by running below command:

python -m venv streamanalytics

And activate your environment by running

source streamanalytics/bin/activate

If you are using anaconda, you can create virtual environment by running below command:

conda create -n streamanalytics python=3.9

To activate conda environment

conda activate streamanalytics

Quick note on virtual environments in Python:

Virtual environment used to isolate specific Python environments on single machine, allowing you to work on multiple projects with different packages and package versions, without conflict.

For example, in project A, you may need Fast API version 1.0 and project B requires Fast API version 2.0. If you simply install without virtual environment, packages will be installed in global python environment, where you can have either version 1.0 or version 2.0, you cannot have both, by isolating project A packages and project B packages, you can have both versions.

Similarly, you may need different version of Pythons in your machines, overwriting/modifying system files when trying to install different Python version in global environment may leads to broken system, where some other functionality which requires the existing specific version may not work.

And when you want to ensure everyone in your team are working with the same package versions, you can use virtual environment to reproduce the environment and share with others.

Download and install the Apache Beam packages:

pip install apache-beam[gcp]

You may have seen only pip install apache-beam instead of apache-beam[gcp] in many documentation. The reason we need apache-beam[gcp] and the difference is Apache Beam[GCP] is a specific implementation of Apache Beam that allows you to run data pipelines on Google Cloud Platform. It provides in-built integration with Google Cloud service such as Cloud Data Runner, BigQuery, Cloud Storage, etc.

There is one more thing, to define our pipeline in programming, we need a text editor, I am using VS code, you can feel free to use any editor you are comfortable with. We are done with package installations and setting up the isolated virtual environment to run our pipeline, let us now create Project in Google Cloud Platform, create Cloud Pub/Sub Topic and Subscription using gcloud command-line tool and enable required access next.

Google Cloud Project

Creating a new project for our streaming analytics project:

gcloud projects create streaming-analytics

List project - this command will list Project ID, Project Name and Project Number, note down the Project ID of streaming-analytics project.

gcloud projects list

Set streaming-analytics project as our current project:

gcloud config set project streaming-analytics

To verify the above steps and view the current project and user, run the below command:

gcloud config list

You must have below permission or Project Creator role to create project on Google Cloud:

resourcemanager.projects.create

Cloud Pub/Sub - Topic & Subscription

Create a Pub/Sub topic:

gcloud pubsub topics create analytics-topic

Create a Pub/Sub subscription and assign it to above created topic:

gcloud pubsub subscriptions create analytics-subscription --topic analytics-topic

Cloud Pub/Sub - Publishing & Pulling Messages

To publish a test message run the following command:

gcloud pubsub topics publish analytics-topic --message '{"name":"streamanalytics", "age":2}'

This message will be sent to Pub/Sub subscription queue, you can pull the message by running following command:

gcloud pubsub subscriptions pull analytics-subscription --format="json(ackId, message.attributes, message.data.decode(\"base64\").decode(\"utf-8\"), message.messageId, message.publishTime)"

Sample output:

gcloud pubsub publish and receive messages

You must have below permissions/roles assigned to create Cloud Pub/Sub topics, subscription and publish and consume messages:

roles/pubsub.publisher
roles/pubsub.subscriber
roles/pubsub.viewer

BigQuery - Creating Dataset

Create dataset in BigQuery: BigQuery dataset can be created by using many client tools, we will be using the Google Cloud Console. For other options, refer the official documentation.

Go to BigQuery Console -> Click on View Action (next to project name) -> Create Dataset

Enter Dataset ID, Data Location, Other options and click on Create Dataset.

BigQuery - CREATE & SELECT Table

In BigQuery, table can be created in many ways, let us create with standard SQL as below and run the BigQuery Editor:

CREATE TABLE streamanalytics.user(
    name STRING,
    age INTEGER
);

In streaming analytics pipeline, we can create tables and populate data from the data pipeline, to setup the environment correctly, we are creating in console and verifying the access.

Populate table with sample data:

INSERT INTO `streamanalytics.user` VALUES ("stream",1);

To query user table:

SELECT * FROM `streamanalytics.user`;

You must have below roles to create dataset & tables, insert records and select records.

BigQuery Admin OR BigQuery Editor OR BigQuery Data Owner

Cloud Dataflow Jobs:

To create and deploy Cloud Dataflow jobs, you need the below permission/roles assigned:

roles/dataflow.admin

Cloud Dataflow Service Account:

Dataflow uses following two service accounts:

Dataflow Service Account - orchestrate dataflow environment, interacts between your project and Dataflow. It is used for worker creation and monitoring. Requires Dataflow Service Agent role.

Controller Service Account - used by the workers to access resources needed by the pipeline, if none specified, it will use the compute engine default service account. You need to assign BigQuery and other services roles to this service account to access it from Dataflow workers.

That’s it for today, we have covered installation and setting up the environment for developing and deploying our pipeline, created Cloud Pub/Sub topic and subscription for publishing and consuming messages, reviewed the permission to create projects, pipelines, BigQuery dataset and tables. We will jump into building our first data pipeline in Apache Beam in next session. Have a good day!