How to perform join operation in BigQuery? Exploring BigQuery Join Operations: Broadcast and Hashing Joins & Nested and Repeated Structures.

BigQuery - SQL Joins
BigQuery - SQL Joins (Photo by Resource Database on Unsplash


SQL joins are used to combine columns from multiple tables to get desired result set. In a typical Relational model we use normalized tables, each table represents an entity (example: employee, department, etc) and its relationships and when we need to get data from more than one tables, for example employee name and employee department, we use joins to combine employee name column from employee table, department name column from department table based on employee number key column, which is available on both the tables.


Similarly, typical data warehouse setup follows Star or Snowflake schema consisting of a primary fact table and satellite dimension tables. Fact tables represents events (example: orders table in a ecommerce business) and dimension table represents attributes and slowly changing information (example: customer, product tables).


Star Schema Example (Source: Wikipedia by SqlPac)


Fact tables are denormalized for faster reads and dimension tables are normalized. In most cases, your analytics query need to join a fact table with multiple dimension tables to get the desired results, for example finding top 10 customers whose transaction value is greater than $100 and their ordered products. These type of queries requires join operation on transaction, customer and product tables to get desired result sets.


BigQuery as a serverless enterprise data warehouse support ANSI SQL joins types such as inner join, left outer join, right outer join, full outer join and cross join. Let's discuss about how join works in BigQuery, basic join types and try it out on BigQuery public dataset: `bigquery-public-data.thelook_ecommerce`


How join operations are performed in BigQuery?

  • Create two result set items: In BigQuery join operation can be performed on tables, subquery, Arrays or WITH statement.
  • Identify a join condition: The join condition does not need to be an equality condition; any Boolean condition can be used. The join condition specified using ON clause or USING clause.
  • Select the column list you wanted in final result set: If the joining result set/table consists of identical column names, use alias to differentiate the columns.
  • Specify the join type: if no joining type is specified, by default considered as inner join.


Inner Join

Inner Join returns only rows that meets the join condition. Effectively calculates Cartesian product of two tables based on a common values and discard all rows that do not meet the join condition.


BigQuery - SQL Inner Join
BigQuery - SQL Inner Join

Joining table A and B and joining B and A are same. If the column names are same, join condition can be specified using USING clause.


Example:

List product names and its order status.


Product schema:



bq query --nouse_legacy_sql 'SELECT column_name, data_type, is_nullable FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = "products";'


+------------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+------------------------+-----------+-------------+
| id | INT64 | YES |
| cost | FLOAT64 | YES |
| category | STRING | YES |
| name | STRING | YES |
| brand | STRING | YES |
| retail_price | FLOAT64 | YES |
| department | STRING | YES |
| sku | STRING | YES |
| distribution_center_id | INT64 | YES |
+------------------------+-----------+-------------+


Order Items schema:



bq query --nouse_legacy_sql 'SELECT column_name, data_type, is_nullable FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = "order_items";'


+-------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------------+-----------+-------------+
| id | INT64 | YES |
| order_id | INT64 | YES |
| user_id | INT64 | YES |
| product_id | INT64 | YES |
| inventory_item_id | INT64 | YES |
| status | STRING | YES |
| created_at | TIMESTAMP | YES |
| shipped_at | TIMESTAMP | YES |
| delivered_at | TIMESTAMP | YES |
| returned_at | TIMESTAMP | YES |
| sale_price | FLOAT64 | YES |
+-------------------+-----------+-------------+


Selecting Product name and Order status using inner join:




SELECT
    p.name as `Product Name`,
    oi.status as `Order Status`
FROM
     `bigquery-public-data.thelook_ecommerce.order_items` as oi
INNER JOIN
    `bigquery-public-data.thelook_ecommerce.products` as p
ON
    oi.product_id = p.id
LIMIT 10;

+--------------------------------------------------------------------------------------------------------+--------------+
| Product Name | Order Status |
+--------------------------------------------------------------------------------------------------------+--------------+
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Shipped |
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Shipped |
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Complete |
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Complete |
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Returned |
| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Returned |
| Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Complete |
| Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Complete |
| Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Returned |
| Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Returned |
+--------------------------------------------------------------------------------------------------------+--------------+


Left Outer Join

Returns all rows from the left side of the join even if the rows not matching join predicate and only matching rows from the right side of the join. All the remaining rows from the right side of join returned as NULL.

BigQuery - SQL Left Join
BigQuery - SQL Left Join


Example

List all the products which never ordered (products not ordered even once).



SELECT
p.name as `Product Name`,
oi.order_id as `Order ID`
FROM
`bigquery-public-data.thelook_ecommerce.products` as p
LEFT JOIN
`bigquery-public-data.thelook_ecommerce.order_items` as oi
ON
oi.product_id = p.id
WHERE oi.status IS NULL
LIMIT 10;

+-------------------------------------------------------------------------------------------+----------+
| Product Name | Order ID |
+-------------------------------------------------------------------------------------------+----------+
| Cloudveil Men's Kahuna Short | NULL |
| 30 inch Inseam Smith's Denim Fleece - lined Jeans Dark Stonewash | NULL |
| RUDE Dark Vintage Skinny Jeans | NULL |
| Nike Golf Men's Flat Front Tech Short | NULL |
| Hurley Men's Poolside Boardwalk Walkshort | NULL |
| Polo Ralph Lauren Men's Walk Golf Shorts Red | NULL |
| Men's Cashmere Cardigan | NULL |
| Lucky Brand Men's Boxer Set | NULL |
| ililily Simple Basic 100% Cotton Baseball 3/4 raglan sleeve T-shirt for Men (tshirts-008) | NULL |
| Pendleton Men's Classic Fit Board Shirt | NULL |
+-------------------------------------------------------------------------------------------+----------+


The above query can rewritten using subqueries with EXISTS condition to fetch the results without using join as we just need the Product name, however we will continue with this syntax for explaining joins.


Right Outer Join

Returns all rows from the right side of join even if the rows are not matching the join predicate and returns only matching rows from the left side of the join. All the remaining rows from left side of the side returned as NULL.


BigQuery - SQL Right Join
BigQuery - SQL Right Join


Example

List all the orders with no products (where products removed from the product table, but there were previous orders exists for those products).




SELECT
p.name as `Product Name`,
oi.order_id as `Order ID`,
FROM
`bigquery-public-data.thelook_ecommerce.products` as p
RIGHT JOIN
`bigquery-public-data.thelook_ecommerce.order_items` as oi
ON
oi.product_id = p.id
WHERE p.name IS NULL
LIMIT 10;

+--------------+----------+
| Product Name | Order ID |
+--------------+----------+
| NULL | 39411 |
| NULL | 75248 |
| NULL | 76724 |
| NULL | 117507 |
| NULL | 32556 |
| NULL | 42489 |
| NULL | 57019 |
| NULL | 92339 |
| NULL | 49356 |
| NULL | 75177 |
+--------------+----------+


Full Outer Join

Full join returns all rows from both side of join for matching conditions and If a row from one side does not join any other row in other side, the row returns with NULL value for other side of the table.


BigQuery - SQL Full Outer Join
BigQuery - SQL Full Outer Join



Example

List all products never ordered and orders with no products


SELECT
p.name as `Product Name`,
oi.order_id as `Order ID`,
FROM
`bigquery-public-data.thelook_ecommerce.products` as p
FULL OUTER JOIN
`bigquery-public-data.thelook_ecommerce.order_items` as oi
ON
oi.product_id = p.id
WHERE p.name IS NULL OR oi.order_id IS NULL;

+------------------------------------------------------------------------------------------------------------------------+----------+
| Product Name | Order ID |
+------------------------------------------------------------------------------------------------------------------------+----------+
| Lucky Brand Men's Boxer Set | NULL |
| NULL | 39411 |
| NULL | 75248 |
| NULL | 76724 |
| NULL | 117507 |
| NULL | 32556 |
| NULL | 42489 |
| NULL | 57019 |
| NULL | 92339 |
| So So Happy Junior Ozzie Hoodie in Blue | NULL |
| ililily Simple Basic 100% Cotton Baseball 3/4 raglan sleeve T-shirt for Men (tshirts-008) | NULL |
| Cloudveil Men's Kahuna Short | NULL |
| Allegra K Woman Faux Crystal Detail Leopard Print Strapped Tank Top M | NULL |
| Fox Womens Juniors Trials Pullover Hoody Sweater | NULL |
| Russell Athletic Women's Dri-Power Fleece Mid Rise Pant | NULL |
| A. Byer Juniors Tropical Cambridge Pant | NULL |
| Lily Of France Womens Sport In Action Sport Bra | NULL |
+------------------------------------------------------------------------------------------------------------------------+----------+



CROSS JOIN

Cross join returns Cartesian product of joining tables, each row from table A  join with each row from table B, so the result set of table A with M rows and table B with N rows will be product of rows from table A and table B (MxN rows).

BigQuery - SQL Cross Join
BigQuery - SQL Cross Join

Example:


List all possible products a customer can buy


SELECT
CONCAT(u.first_name, " ", u.last_name) as `Customer Name`,
p.name as `Product Name`
FROM
`bigquery-public-data.thelook_ecommerce.users` as u
CROSS JOIN
`bigquery-public-data.thelook_ecommerce.products` as p
LIMIT 10;

+---------------+---------------------------------------------------------------+
| Customer Name | Product Name |
+---------------+---------------------------------------------------------------+
| Michelle Rowe | TYR Alliance Team Splice Jammer |
| Michelle Rowe | TYR Sport Men's Solid Racer Swim Suit |
| Michelle Rowe | TYR Sport Men's Square Leg Short Swim Suit |
| Michelle Rowe | TYR Sport Men's 4-Inch Nylon Trainer-A Swim Suit |
| Michelle Rowe | TYR Sport Men's Swim Short/Resistance Short Swim Suit |
| Michelle Rowe | TYR Sport Men's Poly Mesh Trainer Swim Suit |
| Michelle Rowe | 2XU Men's Swimmers Compression Long Sleeve Top |
| Michelle Rowe | TYR Sport Men's Solid Durafast Jammer Swim Suit |
| Michelle Rowe | TYR Sport Men's Alliance Durafast Splice Square Leg Swim Suit |
| Michelle Rowe | TYR Sport Men's Solid Jammer Swim Suit |
+---------------+---------------------------------------------------------------+


Implicitly writing cross join using a comma is known as Comma cross join. The below example shows the comma cross join syntax for the same use case.




SELECT
CONCAT(u.first_name, " ", u.last_name) as `Customer Name`,
p.name as `Product Name`
FROM
`bigquery-public-data.thelook_ecommerce.users` as u, `bigquery-public-data.thelook_ecommerce.products` as p
ORDER BY
1 DESC,
2 DESC
LIMIT 10;

+---------------+------------------------------------------------------------------------------------------------------------+
| Customer Name | Product Name |
+---------------+------------------------------------------------------------------------------------------------------------+
| Zoe Wood |  Â Exclusive Hawaiian Tropic Sunset In Paradise Aloha Shirt |
| Zoe Wood |  Â Exclusive Hawaiian Sunset In Paradise Aloha Shirt |
| Zoe Wood |  Â Exclusive Hawaiian Orchid (100% cotton) Aloha Shirt |
| Zoe Wood |  Â Exclusive Hawaiian Flowers In Paradise Aloha Shirt |
| Zoe Wood |  Â Exclusive Hawaiian All New Hibiscus In Paradise Aloha Shirt |
| Zoe Wood | white Half slip Culotte Underworks Pettipants nylon for women plus size trim double lace knickers Lingerie |
| Zoe Wood | under.me The Leggings |
| Zoe Wood | tokidoki X Marvel Women Retro Spidey Dark Red Hoody |
| Zoe Wood | tokidoki X Marvel Women Captain Kitty Hoodie |
| Zoe Wood | tokidoki The Marvel Pop Zip Hoody in Dark Gray Heather |
+---------------+------------------------------------------------------------------------------------------------------------+



BigQuery Joining Strategies


BigQuery commonly use two joining types: Broadcast join and Hashing join. Typically, when one of the tables is smaller (in few hundred MBs) BigQuery uses Broadcast join and when two large tables BigQuery uses Hashing join, let's discuss about these join types.

Broadcast Join


BigQuery uses broadcasting join when one of the table is smaller (few hundred MBs in size). Logically, when performing join operation in a distributed setup, to join a row from left side table to rows from right side table, we need to bring together both the rows into single worker node. When joining, if one of the table is smaller, it is efficient to broadcast/send rows from that table to worker where rows from larger table is compared.

Typically, when joining, key columns are used to match the rows from both the tables, let say the key value from left side smaller table is 27 matched with every rows from right side larger table with key 27, the broadcast join will send this key (rows from smaller table) to every worker where rows from larger table is computed. Therefore, by broadcasting BigQuery ensures rows from both the sides are colocated and performs local join efficiently.

Let's understand broadcast join with an example. We have a small table distribution center table with 10 rows and a large table inventory items with 491237 rows. In terms of table size it is not that big, only 53 MBs, but for understanding the broadcast join, this is suffice. Table description given below:


Distribution Centers:


❯ bq show --project_id=bigquery-public-data thelook_ecommerce.distribution_centers

Table bigquery-public-data:thelook_ecommerce.distribution_centers

Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Total Logical Bytes Total Physical Bytes Labels
----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- --------
23 Mar 10:06:38 |- id: integer 10 409 409 13988
|- name: string
|- latitude: float
|- longitude: float



Inventory Items:



❯ bq show --project_id=bigquery-public-data thelook_ecommerce.inventory_items

Table bigquery-public-data:thelook_ecommerce.inventory_items

Last modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes Labels
----------------- -------------------------------------------- ------------ ------------- --------------------- ---------------------- --------
23 Mar 10:06:47 |- id: integer 491237 81455014 81455014 53738166
|- product_id: integer
|- created_at: timestamp
|- sold_at: timestamp
|- cost: float
|- product_category: string
|- product_name: string
|- product_brand: string
|- product_retail_price: float
|- product_department: string
|- product_sku: string
|- product_distribution_center_id: integer


Join tables to get Product name and distribution center name:



SELECT
iv.product_name as `Product Name`,
dc.name as `Distribution Center`
FROM
`bigquery-public-data.thelook_ecommerce.inventory_items` as iv
INNER JOIN
`bigquery-public-data.thelook_ecommerce.distribution_centers` as dc
ON
iv.product_distribution_center_id = dc.id;


+----------------------------------------------------------------------------+---------------------+
| Product Name | Distribution Center |
+----------------------------------------------------------------------------+---------------------+
| Quiksilver Waterman Men's On The Rise | Houston TX |
| Quiksilver Waterman Men's On The Rise | Houston TX |
| Quiksilver Waterman Men's On The Rise | Houston TX |
| Quiksilver Waterman Men's On The Rise | Houston TX |
| KEEN Women Bellingham Low Ultralite Sock | Houston TX |
| KEEN Women Bellingham Low Ultralite Sock | Houston TX |
| KEEN Women Bellingham Low Ultralite Sock | Houston TX |
| Husky Animal Hat with Mittens | Houston TX |
| Husky Animal Hat with Mittens | Houston TX |
| Husky Animal Hat with Mittens | Houston TX |
| Husky Animal Hat with Mittens | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Volcom Men's Argyle Socks | Houston TX |
| Wendy Glez Rose Cami Black | Houston TX |
| Wendy Glez Rose Cami Black | Houston TX |
| Wendy Glez Rose Cami Black | Houston TX |
| Carhartt Women's Hooded Knit Jacket | Houston TX |
| Carhartt Women's Hooded Knit Jacket | Houston TX |
| Carhartt Women's Hooded Knit Jacket | Houston TX |
| Carhartt Women's Hooded Knit Jacket | Houston TX |
| Carhartt Women's Hooded Knit Jacket | Houston TX |
| NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX |
| NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX |
| NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX |
+----------------------------------------------------------------------------+---------------------+





Let's check the query execution plan:

BigQuery - Broadcast Join - Execution Plan
BigQuery - Broadcast Join


As you can see from the execution graph there are 3 stages input, joining and output. In S01 stage total 10 rows of id and name columns read operation from distribution centers table, it is described in the below execution plan:

BigQuery - Broadcast join - input stage
BigQuery - Broadcast join - input stage

In S02 stage, two READ operations, 10 rows from S01 and 4,91, 247 rows from inventory items table and BigQuery broadcasted all 10 rows to ALL the rows from the right side of the table for matching. This is specified by the keyword INNER HASH JOIN EACH WITH ALL and output is written to S03 with columns Product Name and Distribution Center.


BigQuery - Broadcast Join - Joining Stage
BigQuery - Broadcast Join - Joining Stage

Hash Join

When joining two large tables BigQuery uses Hash joins. This is computationally more expensive join operation. Broadcasting larger table rows to every worker is inefficient, therefore BigQuery hashes table rows from both side of the join operation, this ensures rows with same key (based on hash value) routed and end up in the same buckets, enabling more efficient local join operation.


Let's understand hash join with an example, we will cross join products and users table to illustrate hash join. This output of cross join is all possible combination of user and products.

Users:


bq show --project_id=bigquery-public-data thelook_ecommerce.users;

Table bigquery-public-data:thelook_ecommerce.users

Last modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes
----------------- --------------------------- ------------ ------------- --------------------- ----------------------
24 Mar 10:23:38 |- id: integer 100000 15816853 15816853 25734386
|- first_name: string
|- last_name: string
|- email: string
|- age: integer
|- gender: string
|- state: string
|- street_address: string
|- postal_code: string
|- city: string
|- country: string
|- latitude: float
|- longitude: float
|- traffic_source: string
|- created_at: timestamp



Products:


bq show --project_id=bigquery-public-data thelook_ecommerce.products;


Table bigquery-public-data:thelook_ecommerce.products

Last modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes
----------------- ------------------------------------ ------------ ------------- --------------------- ----------------------
24 Mar 10:23:37 |- id: integer 29120 4285975 4285975 9762698
|- cost: float
|- category: string
|- name: string
|- brand: string
|- retail_price: float
|- department: string
|- sku: string
|- distribution_center_id: integer



Cross join users and products:



WITH
users as (
SELECT
id,
CONCAT(first_name, " ", last_name) as `CustomerName`
FROM
`bigquery-public-data.thelook_ecommerce.users`),

products as (
SELECT
id,
name,
distribution_center_id
FROM
`bigquery-public-data.thelook_ecommerce.products`)

SELECT * FROM users CROSS JOIN products;


+-------+------------------+-------+---------------------------------------------------------------+------------------------+
| id | CustomerName | id_1 | name | distribution_center_id |
+-------+------------------+-------+---------------------------------------------------------------+------------------------+
| 96585 | Michael Anderson | 27457 | TYR Sport Men's Solid Durafast Jammer Swim Suit | 1 |
| 96585 | Michael Anderson | 27466 | TYR Sport Men's Swim Short/Resistance Short Swim Suit | 1 |
| 96585 | Michael Anderson | 27529 | TYR Sport Men's Poly Mesh Trainer Swim Suit | 1 |
| 96585 | Michael Anderson | 27552 | TYR Sport Men's Solid Racer Swim Suit | 1 |
| 96585 | Michael Anderson | 27487 | TYR Sport Men's 4-Inch Nylon Trainer-A Swim Suit | 1 |
| 96585 | Michael Anderson | 27481 | TYR Alliance Team Splice Jammer | 1 |
| 96585 | Michael Anderson | 27537 | TYR Sport Men's Alliance Durafast Splice Square Leg Swim Suit | 1 |
| 96585 | Michael Anderson | 27510 | TYR Sport Men's Solid Jammer Swim Suit | 1 |
| 96585 | Michael Anderson | 27569 | 2XU Men's Swimmers Compression Long Sleeve Top | 1 |
| 96585 | Michael Anderson | 27445 | TYR Sport Men's Square Leg Short Swim Suit | 1 |
+-------+------------------+-------+---------------------------------------------------------------+------------------------+



Execution plan:

BigQuery - Hash Join - Execution Plan
BigQuery - Hash Join - Execution Plan

We are interested in S03 Join stage, here BigQuery performs EACH WITH EACH matching, i.e every row from left side of table is matched with every row from right side of the table. Additionally, we are limiting the number of rows to 100 to avoid CROSS JOIN output errors.

 

BigQuery - Hash Join - EACH WITH EACH
BigQuery - Hash Join - EACH WITH EACH


How to optimize BigQuery Join operations?


BigQuery recommends following best practices for improving join performance.


Avoiding Self-joins

In Self-join, a table is joined with itself. For example, you may self join employee table to get employee and supervisor. As self-joins are typically compute row-dependant relationships, it may potentially squares the number of output rows. So recommended best practices is to use analytic window function.

Avoid joins that generate more output than input

Typically, when use Cross join two big tables, every row from first table is joined with every row in second table, you may observed the query may not be completing as well. It is recommended to avoid cross joins not just in terms of performance, it impacts the overall cost/slot usage as well.

To avoid performance issues associated with joins that generate more output than input, use aggregate functions to pre-aggregate the data or use window functions.

Join on integer data type columns

It is easier to compare the fixed length Integer columns than variable length string columns, so if when possible use integer data type column to join tables.

Broadcast Joins

When you join multiple tables with larger number of rows and smaller number of rows, it is recommended to place the table with larger number of rows first, followed by tables with smaller number of rows and then place the other tables in decreasing size.

Like other database systems, BigQuery optimizer can determine the best execution plan based on the table statistics, still it is recommended to place the tables with larger rows first.

Hash Joins

In broadcast join (i.e) joining a larger table and smaller table, the system sends smaller table's records to every slots that handles join operation for efficient matching. But when there is no smaller tables, the system cannot send larger table records, so BigQuery uses hash and shuffle operation to shuffle first and second table to fetch the matching keys and sends to a same slot for performing local join. Data shuffling is an expensive operation and it impacts the query performance. Using partitioning and clustering improves overall data shuffling performance. 

Reduce the size of data being joined

Performing join operation on smaller dataset is better, whenever possible filter the dataset before joining. Rewriting queries to fetch only required data from both the joining tables / subqueries helps in reducing  heavy shuffling and other such complex operation on larger datasets and improves overall query performance. When there is a need to perform GROUP BY and JOIN operation on a query, perform aggregation earlier in the query - this will reduce the number of rows joined.


Experimenting with BigQuery Nested and Repeated Structures


Generally speaking reading from a flattened table is much faster and efficient than reading from joining tables. Denormalizing schema helps us to achieve flatten table structure where events and metrics available in a same table. In BigQuery we can use repeated and nested fields to denormalize a table.


Let's analyse the denormalized BigQuery tables using nested and repeated fields and joining normalized table performance with an example.

We have two tables Orders and Orders Nested. In Orders table, we have user_id column which references id column of Users table and orders_nested table is a denormalized table, where users details are stored in a nested column with STRUCT data type.


Orders:



bq show --project_id=bigquery-public-data thelook_ecommerce.orders;

Table bigquery-public-data:thelook_ecommerce.orders

Last modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes
----------------- ---------------------------- ------------ ------------- --------------------- ----------------------
25 Mar 09:46:02 |- order_id: integer 125402 6776727 6776727 14873107
|- user_id: integer
|- status: string
|- gender: string
|- created_at: timestamp
|- returned_at: timestamp
|- shipped_at: timestamp
|- delivered_at: timestamp
|- num_of_item: integer


Orders Nested:



bq show BigQueryLab.orders_nested;

Table bpo-bu-prod-datawarehouse:BigQueryLab.orders_nested

Last modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes
----------------- ---------------------------- ------------ ------------- --------------------- ----------------------
25 Mar 13:19:24 |- order_id: integer 125402 10172371 10172371 2812454
|- status: string
|- gender: string
|- created_at: timestamp
|- returned_at: timestamp
|- shipped_at: timestamp
|- delivered_at: timestamp
|- num_of_item: integer
+- users: record
| |- user_id: integer
| |- first_name: string
| |- last_name: string
| |- city: string


We will get all the orders and corresponding user_id, first_name, last_name and city columns using above tables and analyse its execution plan.


Joining Orders and Users table:



SELECT
orders.*,
users.first_name,
users.last_name,
users.city
FROM
`bigquery-public-data.thelook_ecommerce.orders` as orders
INNER JOIN
`bigquery-public-data.thelook_ecommerce.users` as users
ON
orders.user_id = users.id;

+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+
| order_id | user_id | status | gender | created_at | shipped_at | first_name | last_name | city |
+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+
| 15 | 15 | Shipped | F | 2023-03-24 13:06:16 | 2023-03-24 18:29:16 | Jennifer | Morris | City of Yantai |
| 21 | 20 | Shipped | F | 2021-07-27 03:41:00 | 2021-07-28 00:11:00 | Alice | Davis | Fort Worth |
| 22 | 20 | Shipped | F | 2022-08-24 03:41:00 | 2022-08-25 11:06:00 | Alice | Davis | Fort Worth |
| 26 | 23 | Shipped | F | 2020-10-12 16:54:00 | 2020-10-13 08:14:00 | Kristy | Miller | Bella Vista |
| 27 | 23 | Shipped | F | 2021-03-19 16:54:00 | 2021-03-20 17:46:00 | Kristy | Miller | Bella Vista |
| 32 | 27 | Shipped | F | 2022-03-06 09:01:00 | 2022-03-07 10:43:00 | Lindsey | Singh | London |
| 51 | 44 | Shipped | F | 2022-05-13 12:38:00 | 2022-05-13 23:22:00 | Jeanne | Lamb | Pontal |
| 54 | 46 | Shipped | F | 2022-07-06 14:51:00 | 2022-07-08 07:26:00 | Deborah | Nunez | Chengdu |
| 56 | 46 | Shipped | F | 2022-10-03 14:51:00 | 2022-10-04 18:25:00 | Deborah | Nunez | Chengdu |
| 57 | 47 | Shipped | F | 2022-04-05 15:26:00 | 2022-04-07 04:52:00 | Maria | Li | Chicago |
+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+


Some of the query statistics given below:



"totalBytesProcessed": "10258525",
"totalBytesBilled": "20971520",
"totalSlotMs": "2208",
"finalExecutionDurationMs": "1431"


When joining users and orders table, the query processed around 9 MB of data and returned results in 1.4 seconds and used 2208 milliseconds of slot time and total billable bytes is 20MB. Let's get the same result set from running SELECT statement from orders_nested table.


Selecting from orders_nested table:



SELECT * FROM BigQueryLab.orders_nested;

+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+
| order_id | status | gender | created_at | shipped_at | users |
+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+
| 2028 | Shipped | F | 2021-05-14 16:37:00 | 2021-05-17 00:45:00 | {"user_id":"1585","first_name":"Mariah","last_name":"Johnson","city":"Xingtai"} |
| 2842 | Shipped | F | 2021-09-07 01:22:00 | 2021-09-08 10:18:00 | {"user_id":"2215","first_name":"Jody","last_name":"Smith","city":"Detroit"} |
| 4134 | Shipped | F | 2020-05-19 01:02:00 | 2020-05-20 22:22:00 | {"user_id":"3247","first_name":"Beth","last_name":"Clark","city":"Shanghai"} |
| 4817 | Shipped | F | 2020-12-03 05:35:00 | 2020-12-06 00:00:00 | {"user_id":"3799","first_name":"Donna","last_name":"Spencer","city":"South Hill"} |
| 5259 | Shipped | F | 2023-02-20 00:45:00 | 2023-02-21 09:56:00 | {"user_id":"4147","first_name":"Kristi","last_name":"Luna","city":"Millington"} |
| 5279 | Shipped | F | 2022-09-08 00:55:00 | 2022-09-08 06:33:00 | {"user_id":"4160","first_name":"Caroline","last_name":"Gillespie","city":"Bloomington"} |
| 6412 | Shipped | F | 2022-07-29 02:08:00 | 2022-07-31 02:04:00 | {"user_id":"5065","first_name":"Rachael","last_name":"Wilson","city":"Cape Coral"} |
| 7139 | Shipped | F | 2022-05-12 13:37:00 | 2022-05-14 13:41:00 | {"user_id":"5636","first_name":"Kaitlin","last_name":"Christian","city":"Houston"} |
| 7385 | Shipped | F | 2021-02-09 03:35:00 | 2021-02-11 20:08:00 | {"user_id":"5823","first_name":"Jasmine","last_name":"Stewart","city":"Zhangzhou"} |
| 7692 | Shipped | F | 2020-11-19 02:29:00 | 2020-11-21 15:42:00 | {"user_id":"6074","first_name":"Kim","last_name":"Carey","city":"Pinheiro"} |
+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+



Some of the above query statistics given below:



"totalBytesProcessed": "10172371",
"totalBytesBilled": "10485760",
"totalSlotMs": "1141",
"finalExecutionDurationMs": "1229"


When selecting from the nested table directly without any join, the query processed around 9 MB of data and returned results in 1.2 seconds and used 1141 milliseconds of slot time and total billable bytes is 10 MB.


Following observations made from this experimentation:

  • Read from nested table is comparatively faster than joining from multiple tables.
  • In both scenarios bytes processed is same, but billable bytes processed is higher when joining the tables.
  • Total slot time consumed is higher when joining tables and comparatively low when reading from a nested table.

Nested tables looks better based on the current experimentations, however please note:

  • The tables used in this experiment is smaller in size with few number of rows - so the difference is not big, you can see few milliseconds of difference in execution time, but when testing with bigger tables, this results may vary.
  • The query used is very simple and straightforward, data shuffling is may not be a major consideration here, my guess is when joining bigger tables with uneven distribution, data shuffling will be more.
  • When using nested columns, the table size is increased as we store the additional columns in the same table, this increases the storage cost. 
In addition to nested columns, we can use partitioning and clustering features to efficiently store and retrieve data in BigQuery. However, these observations are true for this specific instance, you need to test your queries and use case to find what works best for you.

Summary


We have covered basic join types such as inner join, left outer join, right outer join, full outer join, cross join, etc. and how to perform join operations in BigQuery and discussed examples with BigQuery public datasets. Also, we have covered common joining strategies in BigQuery such as broadcast join and hash join and explained with an example query execution plan. Finally, we have discussed my experimentation with BigQuery nested and repeated structure and analysed the performance of joining and denormalizing table. I hope this article is helpful and informative. Share your thoughts in comments section, thanks for reading!.

0 thoughts:

Post a Comment