How to perform join operation in BigQuery? Exploring BigQuery Join Operations: Broadcast and Hashing Joins & Nested and Repeated Structures.

BigQuery - SQL Joins
BigQuery - SQL Joins (Photo by Resource Database on Unsplash


SQL joins are used to combine columns from multiple tables to get desired result set. In a typical Relational model we use normalized tables, each table represents an entity (example: employee, department, etc) and its relationships and when we need to get data from more than one tables, for example employee name and employee department, we use joins to combine employee name column from employee table, department name column from department table based on employee number key column, which is available on both the tables.

How to Choose a Data Serialization/Encoding Format? A Practical Guide for Engineers

Data Encoding & Decoding. Image Source: Unsplash
Data Encoding & Decoding. Image Source: Unsplash 

In the world of software, we often work with different types of data like lists, tables, and more. These data structures are designed to be fast and efficient when our computer programs use them. However, sometimes we need to move this data out of our computer's memory, like when we want to save it to a file or send it over the internet. To do this, we have to change the data into a special format made up of 0s and 1s, which is quite different from data structures. This process is what we call encoding or serialization. 

Unlock Advanced Data Visualization: The Complete Guide to Installing and Using Apache Superset on Linux

Data Visualization - Apache Superset Guide. Image Source: Unsplash

Data Visualization - Apache Superset Guide. Image Source: Unsplash 


Note: This article provides a comprehensive guide on deploying and using Apache Superset on a Linux server. It covers the installation and configuration process, as well as the benefits and features of Superset. While the primary focus is on Superset, we will also explore the broader concepts of business intelligence, data analytics, and visualization.

GCP Cloud Pub/Sub Replay: Seeking to timestamp & Seeking to snapshots

Google Cloud Pub/Sub Replay (Pixabay)
Google Cloud Pub/Sub Replay (Pixabay) 


Let's assume, you have data pipeline deployed on Google Cloud Platform, events are published to Cloud Pub/Sub topic from publisher client, and subscribed by a data processing application, which reads data from the Cloud Pub/Sub subscription, process it and write it to BigQuery table.

[Solved] Access is denied - Check credentials and try again - Microsoft Graph - Calendar API

Access is denied - Check credentials and try again - Microsoft Graph - Calendar API
Microsoft Graph (Source: microsoft.com)


When sending API request to Microsoft Graph API, it responds with access denied error. You might have followed the documentation and added the correct permission and granted admin consent for the same, but it still produces the same error. Lets check the solution for this issue in this short article.

Streaming Analytics in Google Cloud Platform (GCP) - Building Data Pipeline with Apache Beam

Building Apache Beam Data Pipeline
Building Apache Beam Data Pipeline (Source: Pixabay) 


In introduction article of this series Streaming Analytics in Google Cloud Platform (GCP) - Introduction, we have seen the basics of streaming analytics, its importance and example uses cases, and short introduction about the Google Cloud Services, we will be using to build Streaming Analytics system in Google Cloud Platform.

Streaming Analytics in Google Cloud Platform (GCP) - Setting Up The Environment

Streaming Analytics in GCP
Streaming Analytics in GCP (Source: Pixabay) 


Hello everyone, in the previous article Streaming Analytics in Google Cloud Platform - Introduction, we have covered what is streaming analytics, what services we are going to use and a quick introduction to each service. In this part of the series, we will begin the installation of SDKs, and libraries and set up our environment.


Streaming Analytics in Google Cloud Platform (GCP) - Introduction

Streaming Analytics in Google Cloud Platform
Streaming Analytics in Google Cloud Platform (image source - pixabay) 

From data-to-decision in real-time 

Welcome to our new series on building a streaming analytics system in the Google Cloud Platform!. Let's begin with a quick introduction. Streaming analytics is the process of analysing data in real-time as it is received. Streaming analytics enables an organisation to gain insights and make decisions based on the most up-to-date data, in real time. This is crucial for business as it allows organisations to respond to changes and opportunities in a timely manner.

Installing and configuring Docker Engine and Docker Compose on CentOS

Installing & Configuring Docker Engine on CentOS
Installing & Configuring Docker Engine & Docker-compose on CentOS. Source: pixabay 


I have written this post as a quick reference guide for installing and configuring Docker Engine and Docker compose on CentOS servers. Knowing the basics of Docker containers helps you focus on the end goal of solving problems, than spending your energy on other less important aspects. If you have experimented with multiple packages and applications, you might be knowing that, Docker containers makes it really easy to install and running softwares without worrying about dependencies, scripts and configurations, etc. Also, when we are building and releasing our solutions, it is important to let others consume it in simple process, Docker images helps you achieve that. I will be covering below topics in this article: 

Real-time Monitoring and Log Streaming in Google Cloud Platform (GCP)

Monitoring Dashboard with charts - credit pixabay 



Getting insights into performance, availability and health status of infrastructure and application is very critical for building and managing reliable systems. When we are dealing with clusters of instances, it is becoming very challenging to collect, aggregate and derive actionable insights from data in real time. There are monitoring tools available to address this challenge, both open source and commercial products, we are going to discuss about how to achieve real-time log streaming analytics and monitoring in Google Cloud Platform.

Understanding MySQL Architecture

The architecture of the world’s most popular open source database system is very important for the Information Technology people. There are many reasons for MySQL’s popularity around the world, but one of the main reasons is its architecture, while there are many big players such as Oracle, Microsoft SQL and DB2, MySQL’s architecture makes it as unique and preferred choice for most of the developers. In this article, we are going to discuss about of the internal architecture of the MySQL relational database management system. The article is for novice database administrators, database developers, software developers and those who are interested to work with MySQL database.


Major components:

The MySQL architecture describes how the different components of a MySQL system relate to one another. The MySQL architecture is basically a client – server system. MySQL database server is the server and the applications which are connecting to MySQL database server are clients. The MySQL architecture contains the following major components.

MySQL Architecture
MySQL ARCHITECTURE

Face Recognition - Computing Euclidean distance in PostgreSQL

In this article, we are going to discuss about implementing Euclidean distance in PostgreSQL database. Before getting into actual implementation, let me give you a quick background to understand the need for writing this article. I have been working on Face Authentication system and to perform Face Verification task, we need to compute the distance between two faces. There are a lot of implementations out there to achieve this using Python, however it did not help in my case, so I have implemented the Euclidean distance computation in PostgreSQL. Let's see about the challenges and solutions in detail.

Face Recognition System:

In simple terms, my implementation of Face Recognition systems consist of two parts:

  • Face Registration
  • Face Verification

Face Registration:


This part contains below steps:

  • Users will register their faces with name and image
  • The registered photos will be stored in folder with name as folder name
  • The user images will be feed to Convolutional Neural Network (CNN) model and extract 128 measurements for each image.
  • Using a K-Nearest Neighbour classifier to train a classifier with name and their corresponding 128 dimension encodings.
  • Save the classifier as pickle file on application directory.
The CNN model we are using here is ResNet network with 29 layers and trained with about 3 million images - thanks to Davis King (dlib) for this great work and making this available to public. Refer this page for understanding more about this model.


Face Verification:


This part contains below steps:

  • When a new user try to login with their image, it will be feed to same Convolutional Neural Network (CNN) and extract 128 dimension number vector.
  • This 128 dimension vector is passed to classifier and it will be compared against all the pre-trained face encodings.
  • The classifier will return name of the encodings where distance comparison value is less than 0.3 (as I defined threshold as 0.3)
To learn more about this Face Recognition system, please refer this series of insightful article from Adam Geitgey. Thanks to Adam for this awesome github repository, great place to start, if you are looking for open source Face Recognition system code.

The Challenges: 


This system works well, however challenges begins when we want to add new user to Face Recognition system. When a new user is registered, the encodings and name of the user should be appended to existing pre-trained data to verify user using classifier. To do this, we have to re-train the entire data - all the registered users and save the new classifier. This approach is not scalable due to following reasons:

  • Adding new users to system is complicated as it requires training for every users.
  • Each time we add new user, need to train all the existing registered user images, generating 128 dimension number vector for all the images during training is time & resource consuming process, it requires higher GPU processing and doing this for each new users is not the right way.
  • Deleting existing user is complicated as it deals with classifier object as pickle file.
  • Parallelising to multiple servers with this approach is not feasible, as classifier stored as pickle file, we need to keep moving this file to all the servers, every-time we make changes to training data.
  • Cannot add multiple images for a person as it increases the computation.

The Solution:


After dealing with these issues for sometime, I have changed my approach to overcome above mentioned challenges. The current implementation works as follows:


Face Recognition - Computing Euclidean distance in PostgreSQL
Face Recognition - Computing Euclidean distance in PostgreSQL