This post contains a brief note on a marathon Big Query SQL immersion session

SQL is the most underrated skill of a data engineer/data scientist. Even though many modern library data munging verbs mirror SQL terminology, most often, data crunching is done using R, Python, Spark, Hive programming. With the advent of cloud services, there is no need to spend time in setting up a Spark cluster or Hadoop cluster. With a click of a button(AWS Glue/GCP DataProc), the cluster is up and running, waiting for a number cruncher to fire away queries. There is no issue in sticking to the above route to get the job done. As one can see, the reason to write SQL queries usually diminishes, as there are many powerful data manipulation libraries in the open source world.

However there might be situations where you might have no choice but to resort to SQL queries. Take Big Query for example. It promises to democratize business analytics skills by providing a powerful columnar storage database and exposing the UI as a standard SQL query engine. One can rely the age old SQL lingo and get powerful query results within seconds. All the mighty distributed processing architecture is abstracted away from the user. A recent trend in the open source community is wrapper libraries for Big Query: you use your favorite programming language, write the data munging code, and the library converts in to SQL query and submits to Big Query. For example, the bigrquery package in R completely abstracts the SQL syntax. This is good for many R programmers as you can be in your favorite environment and do the analysis. However when you hit an issue, often it is the case that the SQL conversion is buggy. There are many other reasons, for one to have a solid working knowledge of SQL queries: one of them being Big Query domination is increasing in the data warehousing space and you might want to write native SQL queries and get the job done.

Am reasonably comfortable with SQL as I have been using it forever. However, for some reason, I felt that I should revisit the syntax and spend a few hours with it, focusing on understanding the language, rather than getting my job done for some number crunching exercise, that is part of a model building work.

Learnings from SQL immersion

  • Understand the basic structure of SQL statement
  • Subqueries and when to use them
  • JOIN queries and when to use them
  • Correlated subqueries:Inner query is related to the Outer query
  • UNNEST function
  • Create custom index columns based on certain columns
  • WINDOW functionality
  • Use functions such as OVER, LAG, ROW_NUMBER
  • Using date functions
  • Creating a repeated value field
  • One can write an ORDER BY clause and refer to the position of a specific column that you want to order by
  • One can refer to a select statement by an ALIAS and refer to that ALIAS in your queries
  • Usage of HAVING clause. It is mainly used for filtering on some aggregate that you are building as a part of your query
  • Everything is a table. You can select from a select statement. You can write subquery and write a select from it
  • Use WITH statement to refer a subquery so that it can be considered as a table
  • Use of ARRAY_LENGTH function to count values in an array
  • Use of ARRAY_AGG function to aggregate values in to an array
  • Big Query UI editor shortcuts - Surprised to know that there is no way to save the query using a keyboard shortcut
  • CASE statement usage
  • COALESCE statement usage. It can be used to populate a dummy value for all null values of a field
  • Use of LIKE operator to check for string matches
  • Besides a JOIN statement, you can put in a WHERE clause. It is better to put in a filter condition with the JOIN statement so that the resulting table is of manageable size
  • Understanding the power of creating small tables via subquery is key to mastering SQL
  • GENERATE_ARRAY is the a field that creates a struct data in a cell value.
  • UNNEST can be used to flatten a repeated field
  • One can use small case statement to create conditional summaries. One can write a case statement and feed in to the aggregate function
  • One can create moving averages using correlated subqueries
  • Correlated subquery can be used to generate summary reports
  • UNNEST function is Big Query specific function
  • WITH OFFSET AS INDEX gives the index as a column
  • You can create lags via correlated subqueries by relying on the row index
  • One can create PARTITION by multiple columns
  • One can use UNNEST functionality and CROSSJOIN to flatten a repeated field
  • There are many built in functions in Big Query that support regex operations
  • UDFs create a function by using a SQL expression or JavaScript code. These can be used in the query expressions

Deliberate Practice (55 queries)

Basis Queries in BQ

Order by column position

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT
  id,
  title,
  quarter,
  quarter_views
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  title = "Big Query - SQL Immersion"
  AND quarter > '2019-01-01'
ORDER BY
  3 DESC

Alias to refer to a table

1
2
3
4
5
6
7
SELECT
  p.staff_id,
  rental_id
FROM
  `jrjames83-1171.sampledata.payments` p
LIMIT
  1000

Simple JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
SELECT
  p.staff_id,
  p.rental_id,
  r.inventory_id
FROM
  `jrjames83-1171.sampledata.payments` p
JOIN
  `jrjames83-1171.sampledata.rental` r
ON
  p.rental_id = r.rental_id
WHERE
  p.amount > 0
  AND p.staff_id !=1
LIMIT
  10

Select from Select

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT
  *
FROM (
  SELECT
    p.staff_id,
    p.rental_id,
    r.inventory_id
  FROM
    `jrjames83-1171.sampledata.payments` p
  JOIN
    `jrjames83-1171.sampledata.rental` r
  ON
    p.rental_id = r.rental_id
  WHERE
    p.amount > 0
    AND p.staff_id NOT IN(1,
      3,
      5)
  LIMIT
    10)

On the BQ UI, one can run the selected query of a bigger query

More elements in a select query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SELECT
  DISTINCT t.inventory_id,
  i.film_id
FROM (
  SELECT
    p.staff_id,
    p.rental_id,
    r.inventory_id
  FROM
    `jrjames83-1171.sampledata.payments` p
  JOIN
    `jrjames83-1171.sampledata.rental` r
  ON
    p.rental_id = r.rental_id
  WHERE
    p.amount > 0
    AND p.staff_id NOT IN(1,
      3,
      5)
  LIMIT
    10) t
JOIN
  `jrjames83-1171.sampledata.inventory` i
ON
  t.inventory_id = i.inventory_id

Use of DISTINCT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT
  COUNT(DISTINCT(id)),
  COUNT(id),
  tag
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag!='undefined'
GROUP BY
  tag
ORDER BY
  1 DESC

Multiple tags

  • No aggregate functions in the where clause

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    
     SELECT
      ARRAY_AGG(DISTINCT(tag)) AS ct,
      id,
      title
    FROM
      `jrjames83-1171.sampledata.top_questions`
    WHERE
      tag !='undefined'
    GROUP BY
      id,
      title
    HAVING
      ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
    LIMIT
      10
    

Use of HAVING

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
SELECT
  ARRAY_AGG(DISTINCT(tag)) AS ct,
  id,
  title
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag !='undefined'
GROUP BY
  id,
  title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
  10

Use of TRIM

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
SELECT
  ARRAY_AGG(DISTINCT(tag)) AS ct,
  id,
  title
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag !='undefined' and TRIM(lower(tag)) LIKE "%python%"
GROUP BY
  id,
  title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
  10

HAVING

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT
  q1.id,
  q1.title,
  tags
FROM (
  SELECT
    ARRAY_AGG(DISTINCT(tag)) AS tags,
    id,
    title,
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    tag !='undefined'
    AND TRIM(LOWER(tag)) LIKE "%python%"
  GROUP BY
    id,
    title ) q1
WHERE
  ARRAY_LENGTH(q1.tags) > 2
ORDER BY
  ARRAY_LENGTH(q1.tags) DESC
LIMIT
  10

Organize subqueries as tables

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
WITH
  q1 AS (
  SELECT
    ARRAY_AGG(DISTINCT(tag)) AS tags,
    id,
    title,
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    tag !='undefined'
    AND TRIM(LOWER(tag)) LIKE "%python%"
  GROUP BY
    id,
    title )
SELECT
  q1.id,
  q1.title,
  tags
FROM  q1
  WHERE
    ARRAY_LENGTH(q1.tags) > 2
  ORDER BY
    ARRAY_LENGTH(q1.tags) DESC
  LIMIT
    10

Give me all the views from the title that contains Python as tag

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
SELECT
  SUM(q2.total_views)
FROM (
  SELECT
    DISTINCT(id),
    total_views
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    id IN(
    SELECT
      DISTINCT(id) AS id
    FROM
      `jrjames83-1171.sampledata.top_questions`
    WHERE
      REGEXP_CONTAINS(title, "python"))) q2

For each tag, get me all the questions that are associated with the tag

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
  tag,
  ARRAY_AGG(DISTINCT(title))
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
LIMIT
  10

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT
  tag,
  ARRAY_AGG(DISTINCT(title)) AS titles
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
ORDER BY
  ARRAY_LENGTH(titles) DESC
LIMIT
  100
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT
  tag,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
ORDER BY
  ct DESC
LIMIT
  100
1
2
3
4
5
6
7
8
9
SELECT
  tag,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
LIMIT
  10
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
SELECT
  SUM(p1.ct)/COUNT(p1.tag)
FROM (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)p1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
 WITH
  stats AS (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)
SELECT
  SUM(ct)/COUNT(tag),
  MIN(ct),
  MAX(ct)
FROM
  stats
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
WITH
  stats AS (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)
SELECT
  *
FROM
  stats
WHERE
  ct > (
  SELECT
    AVG(ct)
  FROM
    stats)

THE CASE STATEMENT

How to use a CASE statement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WITH
  base_table AS (
  SELECT
    DISTINCT(title),
    CASE
      WHEN title LIKE "%python%" THEN 'python'
      WHEN title LIKE "%sql%" THEN 'sql'
      WHEN title LIKE "%javasccript%" THEN 'javascript'
      WHEN title LIKE "%java%" THEN 'java'
  END
    AS
    LANGUAGE
  FROM
    `jrjames83-1171.sampledata.top_questions`)
SELECT
  COUNT(*),
  LANGUAGE
FROM
  base_table
GROUP BY
  2
ORDER BY
  2 DESC

CASE statement revised

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
WITH
  base_table AS (
  SELECT
    DISTINCT(id),
    CASE
      WHEN (title LIKE "%python%" OR tag LIKE "%python%") THEN 'python'
      WHEN (title LIKE "%sql%"
      OR tag LIKE "%sql%") THEN 'sql'
      WHEN (title LIKE "%javascript%" OR tag LIKE "%sql%" ) THEN 'javascript'
      WHEN (title LIKE "%java%"
      OR tag LIKE "%java%") THEN 'java'
    ELSE
    NULL
  END
    AS
    LANGUAGE
  FROM
    `jrjames83-1171.sampledata.top_questions`)
SELECT
  COUNT(*),
  LANGUAGE
FROM
  base_table
GROUP BY
  2
ORDER BY
  1 DESC

Combining data elements and writing a case statement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
WITH
  base_table(
  SELECT
    id,
    title,
    ARRAY_TO_STRING(ARRAY_AGG(DISTINCT tag), " ")
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    id,
    title)

JOINS

INNER JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
JOIN
  tab2
ON
  tab1.num1=tab2.num2

LEFT/RIGHT JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
LEFT JOIN
  tab2
ON
  tab1.num1=tab2.num2
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
RIGHT JOIN
  tab2
ON
  tab1.num1=tab2.num2

FULL OUTER JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
FULL OUTER JOIN
  tab2
ON
  tab1.num1=tab2.num2

City with most orders

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
WITH
  master AS (
  SELECT
    p1.order_id AS x,
    p2.customer_city AS y
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  COUNT(x),
  y
FROM
  master
GROUP BY
  y
ORDER BY
  1 DESC

Max orders by customer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
WITH
  master AS (
  SELECT
    p1.order_id AS x,
    p2.customer_unique_id AS y
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  COUNT(x),
  y
FROM
  master
GROUP BY
  y
ORDER BY
  1 DESC

Use of ROW_NUMBER and PARTITION

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
 SELECT
  p1.order_id AS x1,
  p2.customer_unique_id AS x2,
  p1.order_purchase_timestamp AS y,
  ROW_NUMBER() OVER (PARTITION BY p2.customer_unique_id ORDER BY p1.order_purchase_timestamp)
FROM
  `jrjames83-1171.sampledata.orders` AS p1
JOIN
  `jrjames83-1171.sampledata.customers` AS p2
ON
  p1.customer_id = p2.customer_id
ORDER BY
  1,
  2

Customers acquired every month

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WITH
  base_table AS(
  SELECT
    p2.customer_unique_id AS x1,
    DATE_TRUNC(p1.order_purchase_timestamp,month) AS y,
    ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(p1.order_purchase_timestamp,month)
    ORDER BY
      p1.order_purchase_timestamp) AS rn
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  MAX(rn),
  y
FROM
  base_table
GROUP BY
  y
ORDER BY
  2

Everything is a table and you can use them in clever ways

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
WITH
  customer_order_ranks AS (
  SELECT
    c.customer_unique_id,
    o.order_purchase_timestamp,
    ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id) AS customer_order_number
  FROM
    `jrjames83-1171.sampledata.customers` AS c
  JOIN
    `jrjames83-1171.sampledata.orders` AS o
  ON
    c.customer_id = o.customer_id
  ORDER BY
    1,
    2),
  exclusions AS (
  SELECT
    customer_unique_id,
    MAX(customer_order_number)
  FROM
    customer_order_ranks
  GROUP BY
    1
  HAVING
    MAX(customer_order_number) =1 )
SELECT
  customer_unique_id,
  customer_order_number
FROM
  customer_order_ranks
WHERE
  customer_unique_id NOT IN (
  SELECT
    customer_unique_id
  FROM
    exclusions)

Average days between orders

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
WITH
  customer_order_ranks AS (
  SELECT
    c.customer_unique_id,
    o.order_purchase_timestamp,
    o.order_id,
    ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id ORDER BY o.order_purchase_timestamp) AS customer_order_number
  FROM
    `jrjames83-1171.sampledata.customers` AS c
  JOIN
    `jrjames83-1171.sampledata.orders` AS o
  ON
    c.customer_id = o.customer_id
  ORDER BY
    1,
    2),
  exclusions AS (
  SELECT
    customer_unique_id,
    MAX(customer_order_number)
  FROM
    customer_order_ranks
  GROUP BY
    1
  HAVING
    MAX(customer_order_number) =1 ),
  inclusions AS(
  SELECT
    customer_unique_id,
    customer_order_number,
    order_purchase_timestamp
  FROM
    customer_order_ranks
  WHERE
    customer_unique_id NOT IN (
    SELECT
      customer_unique_id
    FROM
      exclusions)),
  reldata AS(
  SELECT
    customer_unique_id,
    customer_order_number,
    order_purchase_timestamp,
    LAG(order_purchase_timestamp) OVER(PARTITION BY customer_unique_id ORDER BY customer_order_number ) AS prev_purchase
  FROM
    inclusions
  ORDER BY
    customer_unique_id)
SELECT
  AVG(DATE_DIFF(order_purchase_timestamp,prev_purchase, day))
FROM
  reldata
WHERE
  prev_purchase IS NOT NULL

Using CASE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

WITH
  took_offer AS (
  SELECT
    1 AS customer_id,
    34.99 AS spend
  UNION ALL
  SELECT
    2,
    21.99
  UNION ALL
  SELECT
    3,
    179.0
  UNION ALL
  SELECT
    4,
    0.99
  UNION ALL
  SELECT
    5,
    1299.9 ),
  loyalty AS (
  SELECT
    3 AS customer_id_1,
    1 AS status
  UNION ALL
  SELECT
    4,
    1,
  UNION ALL
  SELECT
    8,
    2
  UNION ALL
  SELECT
    12,
    1
  UNION ALL
  SELECT
    10,
    2 ),
  results AS (
  SELECT
    *
  FROM
    took_offer AS t
  FULL OUTER JOIN
    loyalty AS j
  ON
    t.customer_id = j.customer_id_1 )
SELECT
  customer_id,
  spend,
  customer_id_1 status,
  CASE
    WHEN customer_id IS NOT NULL AND customer_id_1 IS NOT NULL THEN "both"
    WHEN customer_id IS NULL
  AND customer_id_1 IS NOT NULL THEN "offer only"
    WHEN customer_id IS NOT NULL AND customer_id_1 IS NULL THEN "loyalty only"
  ELSE
  "na"
END
FROM
  results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
SELECT
  EXTRACT(hour
  FROM
    o.order_purchase_timestamp ) AS hr,
  ROUND(SUM(p.payment_value),2) AS sales,
  SUM(SUM(p.payment_value)) OVER () AS total_sales
FROM
  `jrjames83-1171.sampledata.orders` o
LEFT JOIN
  `jrjames83-1171.sampledata.order_payments` p
ON
  o.order_id = p.order_id
GROUP BY
  1
ORDER BY
  1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
WITH
  base_table AS(
  SELECT
    EXTRACT(hour
    FROM
      o.order_purchase_timestamp ) AS hr,
    ROUND(SUM(p.payment_value),2) AS sales,
    SUM(SUM(p.payment_value)) OVER () AS total_sales
  FROM
    `jrjames83-1171.sampledata.orders` o
  LEFT JOIN
    `jrjames83-1171.sampledata.order_payments` p
  ON
    o.order_id = p.order_id
  GROUP BY
    1
  ORDER BY
    1 )
SELECT
  SUM(sales),
  (CASE
      WHEN hr BETWEEN 0 AND 5 THEN "morning"
      WHEN hr BETWEEN 6
    AND 11 THEN "afternoon"
      WHEN hr BETWEEN 12 AND 16 THEN "evening"
      WHEN hr BETWEEN 17
    AND 24 THEN "late evening"
    ELSE
    "na"
  END
    ) AS timeslot
FROM
  base_table
WHERE
  base_table.hr IS NOT NULL
GROUP BY
  timeslot

Use case statements to create conditional summaries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
WITH
  base_table AS(
  SELECT
    EXTRACT(hour
    FROM
      o.order_purchase_timestamp ) AS hr,
    ROUND(SUM(p.payment_value),2) AS sales,
    SUM(SUM(p.payment_value)) OVER () AS total_sales
  FROM
    `jrjames83-1171.sampledata.orders` o
  LEFT JOIN
    `jrjames83-1171.sampledata.order_payments` p
  ON
    o.order_id = p.order_id
  GROUP BY
    1
  ORDER BY
    1 )
SELECT
  SUM(CASE
      WHEN hr BETWEEN 0 AND 5 THEN sales
    ELSE
    0
  END
    ),
  SUM(CASE
      WHEN hr BETWEEN 7 AND 12 THEN sales
    ELSE
    0
  END
    )
FROM
  base_table

UNNEST and Correlated Subqueries

Correlated subquery

  • This is useful for creating unnest record field
  • A correlated subquery sample
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  bt1.*,
  (
  SELECT
    SUM(bt2.dollars)
  FROM
    base_table bt2
  WHERE
    bt2.index <= bt1.index ) AS cumsum
FROM
  base_table bt1

Moving Averages

1
2
3
4
5
SELECT
  *,
  AVG(dollars) OVER (ORDER BY index ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_ma
FROM
  base_table

Checking for a specific condition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index),
  results AS (
  SELECT
    dollars,
    SUM(dollars) OVER (ORDER BY index ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev
  FROM
    base_table )
SELECT
  COUNT(*)
FROM
  results
WHERE
  dollars - prev <0
  AND prev IS NOT NULL

Another solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  DISTINCT check
FROM (
  SELECT
    (dollars > LAG(dollars) OVER (ORDER BY index) ) AS check
  FROM
    base_table)
WHERE
  check IS NOT NULL

Write using a correlated subquery

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  bt1.dollars,
  bt1.index,
  (
  SELECT
    MAX(bt2.dollars)
  FROM
    base_table bt2
  WHERE
    bt1.index > bt2.index)
FROM
  base_table bt1

DVD Rental case study - Revenue from the first order

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
WITH
  first_amount AS (
  SELECT
    customer_id,
    amount
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date)
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id)
SELECT
  *
FROM
  summary_stat a
JOIN
  first_amount b
ON
  b.customer_id = a.customer_id

DVD Rental case study - Revenue from the first order - Act 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date,
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id)
SELECT
  a.customer_id,
  total_revenue,
  first_amt
FROM
  summary_stat a
JOIN
  first_amount b
ON
  b.customer_id = a.customer_id
ORDER BY
  1

LTVP

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id),
  summary_so_far AS(
  SELECT
    a.customer_id,
    m_pay_date,
    total_revenue,
    first_amt,
    first_amt/total_revenue*100 AS first_as_pct_tot_rev
  FROM
    summary_stat a
  JOIN
    first_amount b
  ON
    b.customer_id = a.customer_id
  ORDER BY
    1)
SELECT
  sf.*,
  (
  SELECT
    sum (p2.amount)
  FROM
    `jrjames83-1171.sampledata.payments` p2
  WHERE
    p2.customer_id =sf.customer_id
    AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
    AND DATE(sf.m_pay_date)+90 ) AS customer_tv_first_30days
FROM
  summary_so_far sf

LTVP using standard functions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id),
  summary_so_far AS(
  SELECT
    a.customer_id,
    m_pay_date,
    total_revenue,
    first_amt,
    first_amt/total_revenue*100 AS first_as_pct_tot_rev
  FROM
    summary_stat a
  JOIN
    first_amount b
  ON
    b.customer_id = a.customer_id
  ORDER BY
    1)
SELECT
  sf.*,
  (
  SELECT
    sum (p2.amount)
  FROM
    `jrjames83-1171.sampledata.payments` p2
  WHERE
    p2.customer_id =sf.customer_id
    AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
    AND DATE_ADD(DATE(sf.m_pay_date), INTERVAL 30 day) ) AS customer_tv_first_30days
FROM
  summary_so_far sf

Moving Average Case Study

Attempt 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
WITH
  bt AS (
  SELECT
    Date,
    close,
    AVG(close) OVER(ORDER BY Date ROWS BETWEEN 50 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY Date ROWS BETWEEN 200 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date)
SELECT
  *,
  (CASE
      WHEN ma_50 > ma_200 THEN FALSE
    ELSE
    TRUE
  END
    ) AS status
FROM
  bt

Attempt 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
WITH
  bt AS (
  SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY Date) AS index
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date),
  bt1 AS (
  SELECT
    Date,
    close,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    bt
  ORDER BY
    Date)
SELECT
  *,
  (CASE
      WHEN ma_50 > ma_200 THEN FALSE
    ELSE
    TRUE
  END
    ) AS status
FROM
  bt1

Attempt 3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
WITH
  bt AS (
  SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY Date) AS index
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date),
  bt1 AS (
  SELECT
    Date,
    close,
    index,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    bt
  ORDER BY
    Date),
  signal AS (
  SELECT
    *,
  IF
    (ma_50 > ma_200,
      "buy",
      "sell") AS status
  FROM
    bt1),
  signal_stat AS (
  SELECT
    *,
    LAG(status) OVER(ORDER BY index) AS lagged
  FROM
    signal
  ORDER BY
    index)
SELECT
  *,
IF
  (lagged!=status,
    "y",
    "n") AS crossover
FROM
  signal_stat

More Complex JOINS and Partitioning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
WITH
  bt AS (
  SELECT
    customer_id,
    rental_id,
    SUM(amount) AS rev
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id,
    rental_id
  ORDER BY
    customer_id,
    rental_id,
    rev DESC),
  bt2 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rev DESC ) AS top
  FROM
    bt
  ORDER BY
    customer_id,
    top),
  bt3 AS (
  SELECT
    bt2.*,
    rent.inventory_id
  FROM
    bt2
  JOIN
    `jrjames83-1171.sampledata.rental` AS rent
  ON
    bt2.customer_id = rent.customer_id
  WHERE
    bt2.top IN (1,
      2)),
  bt4 AS (
  SELECT
    bt3.*,
    inv.*
  FROM
    bt3
  LEFT JOIN
    `jrjames83-1171.sampledata.inventory`AS inv
  ON
    bt3.inventory_id = inv.inventory_id),
  bt5 AS (
  SELECT
    bt4.*,
    film.*
  FROM
    bt4
  LEFT JOIN
    `jrjames83-1171.sampledata.film`AS film
  ON
    bt4.film_id = film.film_id)
SELECT
  customer_id,
  rev,
  rating
FROM
  bt5
ORDER BY
  customer_id

Stackoverflow Text analysis

Most common word for each tag

  • Approach 1

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    
    WITH
      tab1 AS (
      SELECT
        tag,
        SPLIT(title," ") AS tokens
      FROM
        `jrjames83-1171.sampledata.top_questions`),
      tab2 AS (
      SELECT
        tag,
        ftoken
      FROM
        tab1
      CROSS JOIN
        UNNEST(tab1.tokens) AS ftoken),
      tab3 AS (
      SELECT
        tag,
        ftoken,
        COUNT(*) AS ct
      FROM
        tab2
      GROUP BY
        tag,
        ftoken),
      tab4 AS (
      SELECT
        tab3.*,
        ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
      FROM
        tab3)
    SELECT
      *
    FROM
      tab4
    WHERE
      rank = 1
    LIMIT
      10
    
  • Approach 2

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    
    WITH
      tab1 AS (
      SELECT
        tag,
        SPLIT(title," ") AS tokens
      FROM
        `jrjames83-1171.sampledata.top_questions`),
      tab2 AS (
      SELECT
        DISTINCT tag,
        word
      FROM
        tab1,
        UNNEST(tokens) word ),
      tab3 AS (
      SELECT
        tag,
        word,
        COUNT(*) AS ct
      FROM
        tab2
      GROUP BY
        tag,
        word),
      tab4 AS (
      SELECT
        tab3.*,
        ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
      FROM
        tab3)
    SELECT
      *
    FROM
      tab4
    WHERE
      rank = 1
    LIMIT
      10
    

Regex Operations

Advent of code 2020 - Day2(Part A)

  • Attempt 1

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    
    WITH
      bt AS (
      SELECT
        REGEXP_EXTRACT(occurrence_range, "(\\d*)-[\\d]*") AS sidx,
        REGEXP_EXTRACT(occurrence_range, "[\\d]*-(\\d*)") AS eidx,
        *
      FROM
        `jrjames83-1171.sampledata.aoc2017day2`),
      bt1 AS (
      SELECT
        sidx,
        eidx,
        character,
        datafield,
        COUNT(occs) AS occs,
      FROM
        bt,
        UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
      GROUP BY
        1,
        2,
        3,
        4)
    SELECT
      sidx,
      eidx,
      character,
      datafield,
      (CASE
          WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
        ELSE
        FALSE
      END
        ) AS status
    FROM
      bt1
    
  • Attempt 2

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    
    WITH
      bt AS (
      SELECT
        REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
        REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
        *
      FROM
        `jrjames83-1171.sampledata.aoc2017day2`),
      bt1 AS (
      SELECT
        sidx,
        eidx,
        character,
        datafield,
        COUNT(occs) AS occs,
      FROM
        bt,
        UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
      GROUP BY
        1,
        2,
        3,
        4)
    SELECT
      sidx,
      eidx,
      character,
      datafield,
      (CASE
          WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
        ELSE
        FALSE
      END
        ) AS status
    FROM
      bt1
    
  • Attempt-3

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    WITH
      bt AS (
      SELECT
        *,
        REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
        REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
        ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
      FROM
        `jrjames83-1171.sampledata.aoc2017day2`),
      bt1 AS (
      SELECT
        sidx,
        eidx,
        character,
        datafield,
        occs BETWEEN (CAST(sidx AS int64))
        AND (CAST(eidx AS int64)) AS status
      FROM
        bt)
    SELECT
      *
    FROM
      bt1
    WHERE
      status = TRUE
    

Advent of code 2020 - Day2(Part B)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
WITH
  bt AS (
  SELECT
    *,
    REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
    REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
  FROM
    `jrjames83-1171.sampledata.aoc2017day2`),
  bt1 AS (
  SELECT
    sidx,
    eidx,
    character,
    datafield,
    SUBSTR(datafield,CAST(sidx AS int64),1) AS leftw,
    SUBSTR(datafield,CAST(eidx AS int64),1) AS rightw
  FROM
    bt),
  bt2 AS (
  SELECT
    bt1.*,
    (CASE
        WHEN character = leftw AND character <> rightw THEN TRUE
        WHEN character = rightw
      AND character <> leftw THEN TRUE
      ELSE
      FALSE
    END
      ) AS status
  FROM
    bt1)
SELECT
  COUNT(*)
FROM
  bt2
WHERE
  status = TRUE

Takeaway

On a periodic basis, it is always better to revisit a concept/subject/idea, that you think you are extremely familiar with. You will be surprised to know, how much you do not know.