10/07/2025

Mastering SQL Window Functions in PostgreSQL — A Comprehensive Guide with Real Examples

 # Mastering SQL Window Functions in PostgreSQL — A Comprehensive Guide with Real Examples


*💡 All examples in this guide can be run directly on https://sqliteonline.com — just switch the dialect to **PostgreSQL** and start experimenting!*


## Introduction 2

## 1. What Are Window Functions? 2

## 2. The OVER() Clause — Syntax Overview 3

## 3. Functions 4

Aggregating 4

Frames | Range 5

Ranking 6

Filters 6

## 4. Aggregate Functions as Windows 6

## 4.1. Aggregate Functions as Windows : AVG 10

## 4.2. Aggregate Functions as Windows  : Avoid Nested Queries 13

## 4.3. Aggregate Functions as Windows : With Group By 22

## 5. RANGE vs ROWS vs GROUPS (Deep Dive) 25

## 5.1 RANGE vs ROWS 25

## RANGE vs ROWS vs GROUPS (window frames) 30

Query 31

Result & Explanation 31

Query 32

Result & Explanation 32

Query 33

Result & Explanation 34

## 6. Ranking Functions 35

## 6.1. Ranking Functions : ROW_NUMBER() 35

## 6.2. Ranking Functions :  RANK() and DENSE_RANK() 42

## 6.3 . Ranking Functions :  Bucketing with NTILE (n) 46

## 7.Conditional Aggregation 49

## The FILTER Clause: Conditional Aggregation 49

### Query 50

### Explanation 50

### Result 52


## Introduction


Window functions are one of the most powerful yet misunderstood features in modern SQL.  

They enable analytical computations — running totals, moving averages, rankings, and more — **without collapsing your dataset**.


Unlike traditional `GROUP BY` queries, which aggregate rows into a single result per group, window functions allow you to *calculate values across sets of rows* while keeping every row visible.


PostgreSQL has one of the most complete and standards-compliant implementations of window functions, supporting advanced features like:


- `FILTER` clauses for conditional aggregation  

- Frame clauses (`ROWS`, `RANGE`, `GROUPS`)  

- `EXCLUDE` options for fine control  

- `GROUPS` frames (since PostgreSQL 11)  


In this guide, we’ll walk through everything from fundamentals to advanced, real-world examples — using **PostgreSQL** syntax throughout.


## 1. What Are Window Functions?


A **window function** performs a calculation across a set of rows that are somehow related to the current row. The term “window” refers to the set of rows that the function can see.


They’re extremely useful for analytics, time-series reporting, and ranking data — all without using subqueries or complex joins.


In PostgreSQL, window functions are only allowed in the:

- `SELECT` clause  

- `ORDER BY` clause  


They cannot appear inside `WHERE`, `JOIN`, or `GROUP BY`.


## 2. The OVER() Clause — Syntax Overview


Here’s the complete structure of a window function:



<function>([expr])

  [ FILTER (WHERE <predicate>) ]

  OVER (

    [ PARTITION BY <expr>, ... ]

    [ ORDER BY <expr> [ASC|DESC] [NULLS FIRST|LAST], ... ]

    [ { ROWS | RANGE | GROUPS } BETWEEN <frame_start> AND <frame_end> ]

    [ EXCLUDE { CURRENT ROW | TIES | GROUP | NO OTHERS } ]

  );



### Explanation of Each Part


- **`PARTITION BY`**: Defines logical partitions (like “groups”) within the dataset.

- **`ORDER BY`**: Defines the ordering of rows *within each partition*.

- **`ROWS | RANGE | GROUPS`**: Specifies the frame of rows that the function can “see”.

- **`FILTER`**: Allows conditional aggregation within the window (PostgreSQL-specific).

- **`EXCLUDE`**: Removes certain peers (like the current row) from the frame (PostgreSQL ≥ 14).


💡 **An empty `OVER()` clause** means the function is calculated over the entire result set.


## 3. Functions


Aggregating

nearly all aggregate functions can be used as window functions by adding an OVER (...) clause. Here are the ones you’ll use most, grouped by type:

Core numeric aggregates

  • SUM(expr)

  • AVG(expr)

  • MIN(expr), MAX(expr)

  • COUNT(*), COUNT(expr), COUNT(DISTINCT expr)

Boolean aggregates

  • BOOL_AND(expr) / EVERY(expr) (synonyms)

  • BOOL_OR(expr)

Statistical aggregates

  • STDDEV(), STDDEV_SAMP(), STDDEV_POP()

  • VARIANCE(), VAR_SAMP(), VAR_POP()

  • CORR(y, x), COVAR_SAMP(y, x), COVAR_POP(y, x)

  • Linear regression family: REGR_SLOPE(y, x), REGR_INTERCEPT(y, x),
    REGR_R2(y, x), REGR_COUNT(y, x), REGR_AVGX(y, x), REGR_AVGY(y, x),
    REGR_SXX(y, x), REGR_SXY(y, x), REGR_SYY(y, x)

Collectors / concatenation

  • ARRAY_AGG(expr [ORDER BY ...])

  • STRING_AGG(text, delimiter [ORDER BY ...])

  • JSON_AGG(expr [ORDER BY ...]), JSONB_AGG(expr [ORDER BY ...])

  • XMLAGG(xml [ORDER BY ...])

Bitwise (if applicable to your types)

  • BIT_AND(expr) (and BIT_OR(expr) if available in your PG version)



Frames | Range

Window frames define the subset of rows included in a calculation. PostgreSQL supports three frame types:


| Frame | Behavior | Use Case |

|--------|-----------|----------|

| **ROWS** | Counts a specific number of preceding/following rows | Moving averages over fixed row counts |

| **RANGE** | Includes rows based on value range of ORDER BY | Time or value-based rolling windows |

| **GROUPS** | Groups peers with the same ORDER BY value | Handling ties (PostgreSQL ≥ 11) |


Ranking

Ranking functions are used to assign sequential numbers to rows.

### Common Ranking Functions in PostgreSQL


| Function | Description |

|-----------|--------------|

| `ROW_NUMBER()` | Assigns a unique rank (no ties). |

| `RANK()` | Gives same rank to ties, leaving gaps. |

| `DENSE_RANK()` | Same rank for ties, no gaps. |

| `NTILE(n)` | Divides the result set into `n` buckets (quartiles, deciles, etc.). |


Filters

FILTER & EXCLUDE



## 4. Aggregate Functions as Windows


Example: Find Running Min and Max Sales per Customer

Suppose we have a table:

CREATE TABLE StockPrices (

    ticker VARCHAR(10),

    trade_date DATE,

    closing_price DECIMAL(10,2)

);


INSERT INTO StockPrices VALUES

('AAPL', '2023-01-04', 130),

('AAPL', '2023-01-01', 125),

('AAPL', '2023-01-03', 140),

('AAPL', '2023-01-04', 135),

('MSFT', '2023-01-01', 250),

('MSFT', '2023-01-02', 245),

('MSFT', '2023-01-03', 260),

('MSFT', '2023-01-04', 255);



SELECT

    ticker,

    trade_date,

    closing_price,

    MIN(closing_price) OVER w AS min_so_far,

    MAX(closing_price) OVER w AS max_so_far

FROM StockPrices

WINDOW w AS (

    PARTITION BY ticker

    ORDER BY trade_date

    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

);




Of course! This is an excellent example of using SQL Window Functions to perform calculations over a set of rows while still returning all the original rows.

Let's break down the SQL script and its output step-by-step.

1. CREATE TABLE and INSERT Statements

  • CREATE TABLE StockPrices (...): This part creates a table named StockPrices to hold our data. It has three columns:

    • ticker: The stock symbol (e.g., 'AAPL').

    • trade_date: The date of the stock price.

    • closing_price: The price of the stock at the end of the day.

  • INSERT INTO StockPrices VALUES (...): This section populates the table with eight rows of sample data: four records for Apple ('AAPL') and four for Microsoft ('MSFT') for various dates in January 2023.


2. The SELECT Statement with a Window Function

This is the core of what you're asking about. The goal of this query is to show, for each stock on each day, what the minimum and maximum price has been so far for that specific stock.

Let's analyze the key clauses:

SELECT ticker, trade_date, closing_price, ...

This selects the original columns from the table.

MIN(closing_price) OVER w AS min_so_far

MAX(closing_price) OVER w AS max_so_far

These are the window functions.

  • Instead of calculating a single MIN or MAX for the entire table (like a regular aggregate function would), the OVER w clause tells the database to calculate it over a specific "window" or group of rows.

  • The window w is defined in the WINDOW clause at the bottom.

WINDOW w AS (...)

This is a named window clause. It's a clean way to define the rules for the window once and reuse it for both the MIN and MAX functions.

Let's look at the rules inside the WINDOW definition:

  1. PARTITION BY ticker This is the most important part. It divides the data into separate groups, or "partitions," based on the ticker. All 'AAPL' rows are in one partition, and all 'MSFT' rows are in another. The window function will reset its calculation for each new partition. This ensures that the calculation for 'AAPL' never considers the prices from 'MSFT'.

  2. ORDER BY trade_date Within each partition (i.e., for each ticker), this sorts the rows by trade_date from earliest to latest. This is crucial for creating a "running" or "cumulative" calculation.

  3. ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW This is called the "frame clause." It defines exactly which rows within the sorted partition to include in the calculation for each row.

    • UNBOUNDED PRECEDING: Start from the very first row of the partition.

    • CURRENT ROW: End with the row that is currently being processed.

    • In simple terms: "For the row I'm on right now, look at it and all the rows that came before it (within this stock's partition) to find the min/max value."


3. Tying It to the Output (The Walkthrough)

Let's trace the calculation for the AAPL ticker to see how the output is generated.

Partition: ticker = 'AAPL' (The data is first sorted by trade_date)

Ticker

trade_date

closing_price

Window Frame (Prices considered)

min_so_far

max_so_far

AAPL

2023-01-01

125.00

{125.00}

MIN(125) is 125.00

MAX(125) is 125.00

AAPL

2023-01-03

140.00

{125.00, 140.00}

MIN(125, 140) is 125.00

MAX(125, 140) is 140.00

AAPL

2023-01-04

130.00

{125.00, 140.00, 130.00}

MIN(125, 140, 130) is 125.00

MAX(125, 140, 130) is 140.00

AAPL

2023-01-04

135.00

{125.00, 140.00, 130.00, 135.00}

MIN(125, 140, 130, 135) is 125.00

MAX(125, 140, 130, 135) is 140.00



## 4.1. Aggregate Functions as Windows : AVG


This script creates a table called routine_checkups and populates it with data for several cats and dogs, which is perfect for demonstrating the PARTITION BY species functionality.


Sample Dataset

SQL

-- Create the table to store animal checkup data

CREATE TABLE routine_checkups (

    species VARCHAR(50),

    name VARCHAR(50),

    checkup_time TIMESTAMP,

    heart_rate INTEGER

);


-- Insert sample data for multiple species

INSERT INTO routine_checkups (species, name, checkup_time, heart_rate) VALUES

-- Cat Data --

('Cat', 'Archie', '2016-10-13 08:43:00', 175),

('Cat', 'Archie', '2016-12-13 13:22:00', 180),

('Cat', 'Sadie', '2017-01-10 13:58:00', 188),

('Cat', 'Tigger', '2017-03-07 14:59:00', 180),

('Cat', 'Luna', '2017-05-20 10:15:00', 165),


-- Dog Data --

('Dog', 'Buddy', '2016-11-05 09:00:00', 100),

('Dog', 'Lucy', '2017-02-15 11:30:00', 110),

('Dog', 'Max', '2017-04-01 16:45:00', 95),

('Dog', 'Daisy', '2017-06-12 10:00:00', 105),


-- Lizard Data (for another partition) --

('Lizard', 'Spike', '2017-01-20 11:00:00', 80),

('Lizard', 'Leo', '2017-04-10 11:10:00', 84);


SELECT 

    species, 

    name, 

    checkup_time, 

    heart_rate,

    CAST (

        AVG(heart_rate) 

        OVER (PARTITION BY species) 

        AS DECIMAL(5, 2)

    ) AS species_average_heart_rate

FROM routine_checkups

ORDER BY species ASC, checkup_time ASC;




## Code Breakdown

Let's look at the key part of the SELECT query:

SQL

AVG(heart_rate) OVER (PARTITION BY species)


  • AVG(heart_rate): This is the aggregate function. We want to find the average of the heart_rate column.

  • OVER (...): This keyword tells SQL that we're using a window function. It means "perform this calculation over a specific window of data."

  • PARTITION BY species: This is the most important instruction. It tells the function to divide the data into separate "partitions" or groups based on the species column. All the 'Cat' rows go into one partition, all the 'Dog' rows into another, and so on. The AVG calculation is performed independently within each partition and then resets for the next one.

Unlike a GROUP BY clause, which would collapse the rows and only show you one result per species, a window function adds the result of the calculation as a new column to every row in the partition.


## How the Calculation Works

The database performs the following steps to generate your output:

  1. Partition the Data: It first groups all the rows by species.

    • Cat Partition 🐈: It gathers all 5 cat records.

    • Dog Partition 🐕: It gathers all 4 dog records.

    • Lizard Partition 🦎: It gathers all 2 lizard records.

  2. Calculate the Average for Each Partition:

    • For the Cat partition, it calculates: (175 + 180 + 188 + 180 + 165) / 5 = 177.60

    • For the Dog partition, it calculates: (100 + 110 + 95 + 105) / 4 = 102.50

    • For the Lizard partition, it calculates: (80 + 84) / 2 = 82.00

  3. Display the Result: It then displays all the original rows. For each row, it adds the calculated average for that row's partition into the new species_average_heart_rate column.

This is why every cat has 177.60 next to its name, and every dog has 102.50. You get to see both the individual's heart rate and how it compares to the average for its entire species in a single view.



## 4.2. Aggregate Functions as Windows  : Avoid Nested Queries


Nested Queries Not Allowed



💡

🔎 The Query

SELECT 

    species, 

    name, 

    checkup_time, 

    heart_rate,

    EVERY(

        heart_rate >= AVG(heart_rate) OVER (PARTITION BY species)

    ) OVER (PARTITION BY species, name) AS consistently_at_or_above_average

FROM routine_checkups

ORDER BY species ASC, checkup_time ASC;



The error "window function calls cannot be nested" occurs because you're trying to use the result of one window function, AVG(heart_rate) OVER (...), inside another window function, EVERY(...) OVER (...), in a single step. SQL doesn't allow this direct nesting.

To fix this, you need to break the query into two steps using a Common Table Expression (CTE).


## Corrected Query with CTE

Here's the corrected version:

SQL

 -- Step 1: Calculate the average heart rate for each species first.

WITH SpeciesAverages AS (

   SELECT

        species,

        name,

        checkup_time,

        heart_rate,

        AVG(heart_rate) OVER (PARTITION BY species) AS species_avg_heart_rate

    FROM

        routine_checkups

)

-- Step 2: Now use the pre-calculated average in the second window function.

SELECT

    species,

    name,

    checkup_time,

    heart_rate,

    EVERY(heart_rate >= species_avg_heart_rate) OVER (PARTITION BY species, name) AS consistently_at_or_above_average

FROM

    SpeciesAverages

ORDER BY

    species ASC, checkup_time ASC;




## Explanation of the Fix

  1. WITH SpeciesAverages AS (...): This defines a Common Table Expression (CTE), which is like a temporary, named result set. We name it SpeciesAverages.

  2. Step 1 (Inside the CTE): We first calculate the average heart rate for each species (species_avg_heart_rate) and include all the original columns. This resolves the first window function.

  3. Step 2 (Main SELECT): The final SELECT statement then runs on the result of our CTE (SpeciesAverages). Since species_avg_heart_rate is now a regular column, we can use it in the EVERY() window function without nesting. This function now checks if every checkup for a specific animal (PARTITION BY species, name) had a heart rate at or above the pre-calculated species average.




✅ What it means

  1. AVG(heart_rate) OVER (PARTITION BY species)

    • Computes the average heart rate for each species (just like in your previous example).

  2. heart_rate >= AVG(...)

    • For each row, checks if that individual’s heart rate is at or above the species average.

    • This gives a boolean result (TRUE or FALSE).

  3. EVERY(condition) OVER (PARTITION BY species, name)

    • The EVERY() aggregate checks if all rows in the partition are TRUE.

    • The partition here is defined by (species, name) → i.e., all checkups for the same animal.

    • If an animal’s heart rate was always at or above its species average across all checkups → returns TRUE. Otherwise → FALSE.

  4. Alias: consistently_at_or_above_average

    • Adds a column that tells you whether each animal is consistently performing at or above the species average.

-------------------------------------------------------------
9. Aggregate Functions as Windows : Multiple CTEs Usage
-------------------------------------------------------------




WITH species_average_heart_rates AS (

SELECT

species,

name,

checkup_time,

heart_rate,

CAST(AVG(heart_rate) OVER (PARTITION BY species) AS DECIMAL(5,2)) AS species_average_heart_rate

FROM routine_checkups

),


with_consistently_at_or_above_average_indicator AS (

SELECT

species,

name,

checkup_time,

heart_rate,

species_average_heart_rate,

EVERY(heart_rate >= species_average_heart_rate) OVER (PARTITION BY species, name) AS consistently_at_or_above_average

FROM species_average_heart_rates

)


SELECT DISTINCT

species,

name,

heart_rate,

species_average_heart_rate

FROM with_consistently_at_or_above_average_indicator

WHERE consistently_at_or_above_average

ORDER BY

species ASC,

heart_rate DESC;




This is a sophisticated SQL query that uses multiple Common Table Expressions (CTEs) to build up to a final answer in stages.

The main goal is to find the individual animals whose heart rate, across all their checkups, was consistently at or above the average for their species.

Let's break down how it achieves this, step by step.


## Step 1: Calculate Species Averages (First CTE)

SQL

WITH species_average_heart_rates AS (

    SELECT

        species,

        name,

        checkup_time,

        heart_rate,

        CAST(AVG(heart_rate) OVER (PARTITION BY species) AS DECIMAL(5,2)) AS species_average_heart_rate

    FROM routine_checkups

)


  • What it does: This first block, named species_average_heart_rates, is exactly the same as the query in your previous question. It calculates the average heart rate for each species ('Cat', 'Dog', etc.) and adds this average as a new column (species_average_heart_rate) to every single row.

  • Result: A temporary table where each record has the animal's individual heart rate and the average for its entire species.


## Step 2: Create a Consistency Flag (Second CTE)

SQL

, with_consistently_at_or_above_average_indicator AS (

    SELECT

        *,

        EVERY(heart_rate >= species_average_heart_rate) OVER (PARTITION BY species, name) AS consistently_at_or_above_average

    FROM species_average_heart_rates

)


  • What it does: This second CTE, named with_consistently..., uses the result from Step 1 as its source (FROM species_average_heart_rates). Its job is to create a true/false flag for each animal.

  • Key Logic: EVERY(heart_rate >= species_average_heart_rate) OVER (PARTITION BY species, name)

    • PARTITION BY species, name: This groups the data by each unique animal (e.g., all of Archie the Cat's records are in one group, all of Sadie the Cat's in another).

    • EVERY(...): This is a window function that checks if a condition is true for every single row within that partition.

    • In plain English: For each animal, it asks: "Was the heart rate from every single one of your checkups greater than or equal to the species average?" If yes, the new consistently_at_or_above_average column will be true for all of that animal's records. If even one checkup was below average, it will be false.


## Step 3: Filter and Display the Final Results

SQL

SELECT DISTINCT

    species,

    name,

    heart_rate,

    species_average_heart_rate

FROM with_consistently_at_or_above_average_indicator

WHERE consistently_at_or_above_average

ORDER BY

    species ASC,

    heart_rate DESC;


  • FROM with_consistently...: This final query runs on the result from Step 2.

  • WHERE consistently_at_or_above_average: This is the crucial filter. It keeps only the rows where the flag we created in Step 2 is true. This effectively throws away all data for animals that had even one "bad" checkup.

  • SELECT DISTINCT ...: This selects the final columns to show. The DISTINCT keyword ensures that if a qualifying animal had multiple checkups with the exact same heart rate, it would only appear once in the final list.

  • ORDER BY ...: Finally, it sorts the results by species, and then by the highest heart rate first.

### Summary of the Output

The output shows only the animals who passed the WHERE clause. For example:

  • Sadie and Tigger (Cats) appear because their single recorded heart rates (188 and 180) were above the cat average of 177.60.

  • Archie (Cat) does not appear because one of his heart rates (175) was below the average, so the EVERY() function returned false for him.



## 4.3. Aggregate Functions as Windows : With Group By



## Sample Query

Here is a complete, runnable query with the sample data from the image so you can test it yourself.

SQL

-- Create a sample table

CREATE TABLE Animals (

    Species VARCHAR(50),

    Name VARCHAR(50)

);


-- Insert the data from the image

INSERT INTO Animals (Species, Name) VALUES

('Cat', 'Felix'),

('Cat', 'Calvin'),

('Dog', 'Prince'),

('Dog', 'Cleo'),

('Dog', 'Abby'),

('Dog', 'Winston'),

('Dog', 'Murphy'),

('Dog', 'Leo'),

('Rabbit', 'Milly');


-- The query from your image

SELECT

    Species,

    COUNT(*) AS "Group",

    SUM(COUNT(Name)) OVER () AS "Window"

FROM

    Animals

GROUP BY

    Species

ORDER BY

    Species;




This SQL query is an excellent example of how window functions can be combined with standard GROUP BY aggregations.

The query's goal is to show two different levels of counting in one result:

  1. The count of animals within each species.

  2. The total count of all animals in the entire table.


## How It Works: A Step-by-Step Breakdown

The database processes this query in a specific order, which is key to understanding the result.

### Step 1: The GROUP BY Aggregation

First, the database executes the standard aggregation part of the query: SELECT Species, COUNT(*) FROM Animals GROUP BY Species;

This groups all rows by the Species column and counts the number of animals in each group. This process produces an intermediate, temporary result that looks like this:

Species

COUNT(*)

Cat

2

Dog

6

Rabbi t

1


This step is responsible for the Group column in your final output.


### Step 2: The Window Function

Next, the window function is applied to the intermediate result from Step 1.

SUM(COUNT(Name)) OVER() AS Window

Let's break this part down:

  • COUNT(Name): This is the aggregate value that the window function will operate on. From Step 1, these values are 2, 6, and 1.

  • OVER(): The empty OVER() clause defines the "window" or the set of rows the function should consider. When it's empty, it means the window is the entire result set from Step 1.

  • SUM(...): This function is applied over the window. It sums the COUNT(Name) values from all rows in the window.

So, it performs the calculation: SUM(2 + 6 + 1) = 9.

Because the window (OVER()) includes all rows, this same result (9) is applied to every row of the output. This is how the Window column is generated.




## 5. RANGE vs ROWS vs GROUPS (Deep Dive)


## 5.1 RANGE vs ROWS


Common Table Expressions (CTEs) to perform a multi-step time-series analysis.

The overall goal is to calculate the percentage change in the number of vaccinations for a given year compared to the average number of vaccinations from the previous two years.

Let's break it down into the three main steps.

Of course. Here are the SQL statements to create the vaccinations table and insert sample data that matches the logic and output of your query.

This script is written for PostgreSQL, matching the syntax in your screenshot (DATE_PART). It uses the generate_series function to efficiently create the required number of timestamped records for each year.


## Table Creation and Data Insertion

SQL

-- First, create the table to hold vaccination records.

-- Each row represents a single vaccination event.

CREATE TABLE vaccinations (

    id SERIAL PRIMARY KEY,

    vaccination_time TIMESTAMP NOT NULL

);


-- Now, insert the sample data.

-- We use generate_series to create the exact number of records needed per year

-- to match the counts from your query's output.


-- Insert 11 records for the year 2016

INSERT INTO vaccinations (vaccination_time)

SELECT '2016-01-01'::timestamp + (n || ' days')::interval

FROM generate_series(1, 11) AS n;


-- Insert 23 records for the year 2017

INSERT INTO vaccinations (vaccination_time)

SELECT '2017-01-01'::timestamp + (n || ' days')::interval

FROM generate_series(1, 23) AS n;


-- Insert 32 records for the year 2018

INSERT INTO vaccinations (vaccination_time)

SELECT '2018-01-01'::timestamp + (n || ' days')::interval

FROM generate_series(1, 32) AS n;


-- Insert 51 records for the year 2019

-- (The count of 51 is derived from the final row of your query's output)

INSERT INTO vaccinations (vaccination_time)

SELECT '2019-01-01'::timestamp + (n || ' days')::interval

FROM generate_series(1, 51) AS n;


After running this script, you will have a vaccinations table populated with the necessary data to execute the query from your screenshot and get the same results.

Of course. Here is the complete, single query from the screenshot, combining all the steps.

This query first calculates the total vaccinations per year, then uses that result to calculate a two-year moving average, and finally calculates the percentage change based on that average.


## Complete SQL Query

SQL

-- Step 1: Count the total number of vaccinations for each year.

WITH annual_vaccinations AS (

    SELECT

        CAST(DATE_PART('year', vaccination_time) AS INT) AS year,

        COUNT(*) AS number_of_vaccinations

    FROM

        vaccinations

    GROUP BY

        DATE_PART('year', vaccination_time)

),

-- Step 2: For each year, calculate the average of the previous two years.

annual_vaccinations_with_previous_2_year_average AS (

    SELECT

        *,

        CAST(

            AVG(number_of_vaccinations) OVER (

                ORDER BY year ASC RANGE BETWEEN 2 PRECEDING AND 1 PRECEDING

            ) 

        AS DECIMAL(5, 2)) AS previous_2_years_average

    FROM

        annual_vaccinations

)

-- Step 3: Calculate the percentage change based on the moving average.

SELECT

    *,

    CAST(

        (100 * number_of_vaccinations / previous_2_years_average) 

    AS DECIMAL(5, 2)) AS percent_change

FROM

    annual_vaccinations_with_previous_2_year_average

ORDER BY

    year ASC;


What it does

## Step 1: Count Vaccinations Per Year

SQL

WITH annual_vaccinations AS (

    SELECT

        CAST(DATE_PART('year', vaccination_time) AS INT) AS year,

        COUNT(*) AS number_of_vaccinations

    FROM

        vaccinations

    GROUP BY

        DATE_PART('year', vaccination_time)

)


  • Purpose: This first CTE, named annual_vaccinations, aggregates the raw data.

  • How it works:

    • It extracts the year from the vaccination_time column.

    • It then GROUP BY that year and uses COUNT(*) to count the total number of vaccinations that occurred in each year.

  • Result: A simple temporary table with two columns: year and number_of_vaccinations. Based on your output, this table would start like this:


year

number_of_vaccinations


2016

11


2017

23


2018

32


...

...




## Step 2: Calculate a 2-Year Moving Average

SQL

, annual_vaccinations_with_previous_2_year_average AS (

    SELECT

        *,

        CAST(AVG(number_of_vaccinations)

            OVER (ORDER BY year ASC RANGE BETWEEN 2 PRECEDING AND 1 PRECEDING)

            AS DECIMAL(5, 2)) AS previous_2_years_average

    FROM

        annual_vaccinations

)


  • Purpose: This CTE takes the yearly counts from Step 1 and calculates a moving average for each year based on the two prior years.

  • How it works:

    • This uses a window function (AVG(...) OVER (...)).

    • The key is the frame clause: RANGE BETWEEN 2 PRECEDING AND 1 PRECEDING. For each year (the "current row"), this tells the AVG function to only look at the two rows that came immediately before it.

Let's trace the calculation for previous_2_years_average:

  • For 2016: There are no preceding years, so the result is NULL.

  • For 2017: There is only one preceding year (2016, with 11 vaccinations). The average is just 11.00.

  • For 2018: The two preceding years are 2017 (23) and 2016 (11). The average is (23 + 11) / 2 = 17.00.


## Step 3: Calculate the Final Percentage Change

SQL

SELECT

    *,

    CAST((100 * number_of_vaccinations / previous_2_years_average)

        AS DECIMAL(5, 2)) AS percent_change

FROM

    annual_vaccinations_with_previous_2_year_average

ORDER BY

    year ASC;


  • Purpose: This is the final step where the percentage change is calculated using the results from the previous steps.

  • How it works:

    • It takes the number_of_vaccinations for the current year.

    • It divides it by the previous_2_years_average that we just calculated.

    • It multiplies by 100 to get a percentage.

Let's trace the percent_change calculation:

  • For 2017: (100 * 23 / 11.00) = 209.09%. This indicates a significant increase compared to the previous year.

  • For 2018: (100 * 32 / 17.00) = 188.24%. This shows that 2018's total was 188.24% of the average of 2016 and 2017. 📈



## RANGE vs ROWS vs GROUPS (window frames)

  • ## A Better Example: Daily Sales

This table tracks sales amounts per day. Crucially, several days have the exact same sales_amount, creating the ties we need to see how the frame clauses differ.

### SQL Setup

SQL

CREATE TABLE daily_sales (

    sale_date    DATE,

    sales_amount INT

);


INSERT INTO daily_sales (sale_date, sales_amount) VALUES

('2025-10-01', 100), -- Group 1

('2025-10-02', 120), -- Group 2

('2025-10-03', 120), -- Group 2 (tie)

('2025-10-04', 150), -- Group 3

('2025-10-05', 150), -- Group 3 (tie)

('2025-10-06', 150), -- Group 3 (tie)

('2025-10-07', 180); -- Group 4


Now let's run a moving sum calculation using each of the three frame clauses. Our window will be "the current entry and the one immediately preceding it".


### 1. ROWS Frame: Counts Physical Rows

The ROWS frame is the most straightforward; it defines the window by a fixed number of physical rows before or after the current row.

Query

SQL

SELECT sale_date, sales_amount, SUM(sales_amount) OVER (

    ORDER BY sales_amount

    ROWS BETWEEN 1 PRECEDING AND CURRENT ROW

) AS moving_sum

FROM daily_sales;


Result & Explanation

The calculation is always current_row_value + previous_row_value. It doesn't care about ties.

sale_date

sales_amount

moving_sum

Calculation

2025-10-01

100

100

100

2025-10-02

120

220

100 + 120

2025-10-03

120

240

120 + 120

2025-10-04

150

270

120 + 150

2025-10-05

150

300

150 + 150

2025-10-06

150

300

150 + 150

2025-10-07

180

330

150 + 180



### 2. RANGE Frame: Counts by Value Distance

The RANGE frame defines the window by a range of values relative to the current row's value. For RANGE BETWEEN 1 PRECEDING AND CURRENT ROW with integer ordering, it includes rows where the value is between current_value - 1 and current_value.

Query

SQL

SELECT sale_date, sales_amount, SUM(sales_amount) OVER (

    ORDER BY sales_amount

    RANGE BETWEEN 1 PRECEDING AND CURRENT ROW

) AS moving_sum

FROM daily_sales;


Result & Explanation

The behavior can be subtle. For the row with sales of 120, the range is [119, 120]. This range does not include the row with sales of 100. This is a key difference from ROWS.

sale_date

sales_amount

moving_sum

Calculation (Values in range)

2025-10-01

100

100

{100}

2025-10-02

120

240

{120, 120}

2025-10-03

120

240

{120, 120}

2025-10-04

150

450

{150, 150, 150}

2025-10-05

150

450

{150, 150, 150}

2025-10-06

150

450

{150, 150, 150}

2025-10-07

180

180

{180}



### 3. GROUPS Frame: Counts Groups of Ties

The GROUPS frame defines the window by a number of distinct value groups. It treats all tied rows as a single group.

Query

SQL

SELECT sale_date, sales_amount, SUM(sales_amount) OVER (

    ORDER BY sales_amount

    GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW

) AS moving_sum

FROM daily_sales;


Result & Explanation

This is where the power of GROUPS shines. The window is defined as "the current group of peers + the one preceding group of peers."

sale_date

sales_amount

moving_sum

Calculation (Groups included)

2025-10-01

100

100

{100}

2025-10-02

120

340

{100} + {120, 120}

2025-10-03

120

340

{100} + {120, 120}

2025-10-04

150

710

{120, 120} + {150, 150, 150}

2025-10-05

150

710

{120, 120} + {150, 150, 150}

2025-10-06

150

710

{120, 120} + {150, 150, 150}

2025-10-07

180

630

{150, 150, 150} + {180}


  • ## Summary

Frame Clause

Basis of Calculation

Best For...

ROWS

Physical row position

A moving average over a fixed number of observations (e.g., last 7 entries).

RANGE

Value of the ordering column

Calculations over a continuous value range (e.g., all sales within the last 30 days of a date).

GROUPS

Distinct value groups (peers)

Calculations based on distinct steps or tiers in the data (e.g., this sales tier and the one just before it).



## 6. Ranking Functions


## 6.1. Ranking Functions : ROW_NUMBER()


## Table Setup and Sample Data

SQL

-- Drop the table if it already exists to start fresh

DROP TABLE IF EXISTS routine_checkups;


-- Create the table to store checkup records

CREATE TABLE routine_checkups (

    id SERIAL PRIMARY KEY,

    species VARCHAR(50),

    name VARCHAR(50),

    checkup_time TIMESTAMP

);


-- Insert sample data

-- We will create more than 3 animals per species to test the "top 3" filter.


-- === Cat Data ===

-- Luna: 4 checkups (will be #1)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Cat', 'Luna', '2024-01-15 10:00:00'),

('Cat', 'Luna', '2024-05-20 10:30:00'),

('Cat', 'Luna', '2024-09-10 11:00:00'),

('Cat', 'Luna', '2025-01-15 10:00:00');


-- Milo: 3 checkups (will be #2)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Cat', 'Milo', '2024-02-11 14:00:00'),

('Cat', 'Milo', '2024-07-22 14:15:00'),

('Cat', 'Milo', '2025-02-11 14:00:00');


-- Bella: 2 checkups (will be #3)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Cat', 'Bella', '2024-04-01 09:00:00'),

('Cat', 'Bella', '2025-04-01 09:30:00');


-- Leo: 1 checkup (will be filtered out)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Cat', 'Leo', '2024-08-05 16:00:00');



-- === Dog Data ===

-- Max: 5 checkups (will be #1)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Dog', 'Max', '2024-01-10 11:00:00'),

('Dog', 'Max', '2024-04-10 11:00:00'),

('Dog', 'Max', '2024-07-10 11:00:00'),

('Dog', 'Max', '2024-10-10 11:00:00'),

('Dog', 'Max', '2025-01-10 11:00:00');


-- Buddy: 4 checkups (will tie for #2, ranked first by name 'Buddy' < 'Daisy')

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Dog', 'Buddy', '2024-03-15 13:00:00'),

('Dog', 'Buddy', '2024-06-15 13:00:00'),

('Dog', 'Buddy', '2024-09-15 13:00:00'),

('Dog', 'Buddy', '2024-12-15 13:00:00');


-- Daisy: 4 checkups (will tie for #2, ranked second by name)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Dog', 'Daisy', '2024-05-01 08:30:00'),

('Dog', 'Daisy', '2024-08-01 08:30:00'),

('Dog', 'Daisy', '2024-11-01 08:30:00'),

('Dog', 'Daisy', '2025-02-01 08:30:00');


-- Lucy: 2 checkups (will be filtered out)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Dog', 'Lucy', '2024-06-20 15:00:00'),

('Dog', 'Lucy', '2025-06-20 15:00:00');


-- Rocky: 1 checkup (will be filtered out)

INSERT INTO routine_checkups (species, name, checkup_time) VALUES

('Dog', 'Rocky', '2025-03-03 12:00:00');



WITH animal_checkups AS (

  SELECT

    rc.species,

    rc.name,

    COUNT(*) AS number_of_checkups

  FROM routine_checkups rc

  GROUP BY rc.species, rc.name

),

ranked AS (

  SELECT

    *,

    ROW_NUMBER() OVER (

      PARTITION BY species

      ORDER BY number_of_checkups DESC, name ASC

    ) AS row_number

  FROM animal_checkups

)

SELECT species, name, number_of_checkups

FROM ranked

WHERE row_number <= 3

ORDER BY species ASC, number_of_checkups DESC, name ASC;



What your query does

This query identifies the top 3 most frequently checked-up animals within each species.

It uses a multi-step process with Common Table Expressions (CTEs) to first count the checkups, then rank the animals, and finally filter for the top 3.


## Step 1: Count Checkups for Each Animal

SQL

WITH animal_checkups AS (

  SELECT

    rc.species,

    rc.name,

    COUNT(*) AS number_of_checkups

  FROM routine_checkups rc

  GROUP BY rc.species, rc.name

)


The first CTE, animal_checkups, creates a summary of the data. It groups the routine_checkups table by both species and name to find each unique animal and uses COUNT(*) to calculate the total number of checkups for each one.

The result of this step is a temporary table that looks something like this:


This query identifies the top 3 most frequently checked-up animals within each species.

It uses a multi-step process with Common Table Expressions (CTEs) to first count the checkups, then rank the animals, and finally filter for the top 3.


## Step 2: Rank Animals Within Each Species

SQL

, ranked AS (

  SELECT

    *,

    ROW_NUMBER() OVER (

      PARTITION BY species

      ORDER BY number_of_checkups DESC, name ASC

    ) AS row_number

  FROM animal_checkups

)


The second CTE, ranked, takes the summary from Step 1 and assigns a rank to each animal.

  • ROW_NUMBER() OVER (...): This is the window function that assigns the rank.

  • PARTITION BY species: This is the most important part. It tells the function to create separate rankings for each species. The rank counter will reset back to 1 for each new species. 🐈 🐕

  • ORDER BY number_of_checkups DESC, name ASC: Within each species, this sorts the animals by who had the most checkups first (DESC). If two animals have the same number of checkups, it sorts them alphabetically by name (ASC) to ensure a consistent order.

The result of this step is the previous table with a new row_number column:


## Step 3: Filter for the Top 3

SQL

SELECT species, name, number_of_checkups

FROM ranked

WHERE row_number <= 3

ORDER BY species ASC, number_of_checkups DESC, name ASC;


The final SELECT statement queries the result from the ranked CTE and applies a filter.

  • WHERE row_number <= 3: This is the clause that accomplishes the main goal. It keeps only the rows that were ranked 1st, 2nd, or 3rd within their species, effectively giving you the "top 3".

  • The final ORDER BY clause simply formats the output to be clean and readable.


## 6.2. Ranking Functions :  RANK() and DENSE_RANK()


The main difference between RANK() and DENSE_RANK() is how they handle the ranking number that comes after a tie: RANK() leaves gaps, while DENSE_RANK() does not.

Here are examples using the same sample dataset to make the difference clear.


## SQL Setup and Sample Data

First, let's create a simple table to store scores from a competition. The data is intentionally designed with several ties to show how each function behaves.

SQL

-- Create a table for competition scores

CREATE TABLE competition_scores (

    student_name VARCHAR(50),

    score INT

);


-- Insert sample data with several ties

INSERT INTO competition_scores (student_name, score) VALUES

('Alice', 98),

('Bob', 95),    -- Tie for 2nd

('Charlie', 95), -- Tie for 2nd

('Diana', 92),

('Eve', 88),     -- 3-way tie

('Frank', 88),   -- 3-way tie

('Grace', 88),   -- 3-way tie

('Henry', 85);



## RANK(): Ranks with Gaps

The RANK() function assigns the same rank to tied values, but it then skips the next rank(s) in the sequence. The rank number reflects the number of rows that came before it.

### Query

SQL

SELECT

    student_name,

    score,

    RANK() OVER (ORDER BY score DESC) AS rank_with_gaps

FROM

    competition_scores;


### Result

student_name

score

rank_with_gaps

Alice

98

1

Bob

95

2

Charlie

95

2

Diana

92

4

Eve

88

5

Frank

88

5

Grace

88

5

Henry

85

8

### Explanation

  • Alice is ranked 1st.

  • Bob and Charlie tie for 2nd place.

  • Because two people occupy the 2nd position, the next rank is 4th (Diana). The 3rd rank is skipped.

  • Similarly, after three students tie for 5th place, the next rank is 8th (Henry). Ranks 6 and 7 are skipped.


## DENSE_RANK(): Ranks without Gaps

The DENSE_RANK() function also assigns the same rank to tied values, but it does not skip any numbers in the sequence. It always assigns the next consecutive rank after a tie.

### Query

SQL

SELECT

    student_name,

    score,

    DENSE_RANK() OVER (ORDER BY score DESC) AS rank_without_gaps

FROM

    competition_scores;


### Result

student_name

score

rank_without_gaps

Alice

98

1

Bob

95

2

Charlie

95

2

Diana

92

3

Eve

88

4

Frank

88

4

Grace

88

4

Henry

85

5

### Explanation

  • Alice is ranked 1st.

  • Bob and Charlie tie for 2nd place.

  • The next student, Diana, is ranked 3rd. No rank is skipped.

  • After the three-way tie for 4th place, Henry is ranked 5th.


## Key Takeaway

Function

Tie Handling

Gaps?

Use Case Example

RANK()

Same rank for ties

Yes

Finding a "Top 10" list where ties mean fewer spots are available.

DENSE_RANK()

Same rank for ties

No

Grouping items into distinct performance tiers (e.g., Gold, Silver, Bronze).



## 6.3 . Ranking Functions :  Bucketing with NTILE (n)




The NTILE(n) window function is a powerful tool for dividing your data into a specific number of ranked groups, like deciles (NTILE(10)) or quartiles (NTILE(4)).

Here is the sample data setup, followed by a detailed explanation of how the function works with that data.


## SQL Setup and Sample Data

This script creates an admissions table and efficiently inserts 23 sample records with sequential dates. We use 23 records specifically because it's not evenly divisible by 10, which clearly demonstrates how NTILE handles remainders.

SQL

-- Drop the table if it exists to ensure a fresh start

DROP TABLE IF EXISTS admissions;


-- Create the table to hold patient admission records

CREATE TABLE admissions (

    patient_id SERIAL PRIMARY KEY,

    admission_date DATE NOT NULL

);


-- Insert 23 records with dates from Jan 2nd to Jan 24th, 2025

INSERT INTO admissions (admission_date)

SELECT '2025-01-01'::date + (n || ' days')::interval

FROM generate_series(1, 23) AS n;

## Query and Explanation

Here is the query that uses NTILE(10) on our sample data.

SQL

SELECT

    patient_id,

    admission_date,

    NTILE(10) OVER (ORDER BY admission_date ASC) AS ten_segments

FROM

    admissions

ORDER BY

    admission_date ASC;


### What It Does

The query's main purpose is to assign each admission record to one of ten "segments" or buckets based on its admission_date.

  • NTILE(10): This specifies that we want to divide the data into 10 groups.

  • OVER (ORDER BY admission_date ASC): This is the crucial part. Before creating the groups, the function must sort all 23 rows by the admission_date in ascending order (earliest to latest).

  • AS ten_segments: This is the name given to the new column that will hold the bucket number (1 through 10).


## How the Buckets Are Formed

This is where the logic of NTILE becomes clear.

  1. Count the Rows: We have 23 total admission records.

  2. Determine Group Size: We want to divide these 23 rows into 10 buckets.

    • 23 / 10 = 2 with a remainder of 3.

  3. Distribute the Remainder: The NTILE function handles the remainder by adding one extra row to the first few buckets.

    • The first 3 buckets (the remainder) will get the base size of 2 plus 1 extra row, for a total of 3 rows each.

    • The remaining 7 buckets (buckets 4 through 10) will get the base size of 2 rows each.

So, when you run the query, the output will look like this:



patient_id

admission_date

ten_segments

1

2025-01-02

1

2

2025-01-03

1

3

2025-01-04

1

4

2025-01-05

2

5

2025-01-06

2

6

2025-01-07

2

7

2025-01-08

3

8

2025-01-09

3

9

2025-01-10

3

10

2025-01-11

4

11

2025-01-12

4

12

2025-01-13

5

13

2025-01-14

5

14

2025-01-15

6

15

2025-01-16

6

16

2025-01-17

7

17

2025-01-18

7

18

2025-01-19

8

19

2025-01-20

8

20

2025-01-21

9

21

2025-01-22

9

22

2025-01-23

10

23

2025-01-24

10


As you can see, the earliest 3 admissions are assigned to bucket 1, the next 3 are in bucket 2, the next 3 are in bucket 3, and then the buckets starting from 4 onwards contain only 2 records each.



## 7.Conditional Aggregation


## The FILTER Clause: Conditional Aggregation

The FILTER clause lets you apply an aggregate function to a subset of rows within your window, effectively adding a WHERE condition to the aggregation itself. This is often much cleaner than using a CASE statement inside the aggregate function.

### SQL Setup and Sample Data

Let's use a table of product sales, where we want to count total sales and "large" sales (over $500) per month.

SQL

CREATE TABLE product_sales (

    sale_month DATE,

    sale_amount INT

);


INSERT INTO product_sales (sale_month, sale_amount) VALUES

('2025-01-01', 200),

('2025-01-01', 600), -- large sale

('2025-01-01', 350),

('2025-02-01', 400),

('2025-02-01', 800), -- large sale

('2025-02-01', 950); -- large sale


### Query

This query calculates the total sales count per month and, using FILTER, the count of large sales in that same month.

SQL

SELECT DISTINCT

    sale_month,

    COUNT(*) OVER (PARTITION BY sale_month) AS total_sales_count,

    COUNT(*) FILTER (WHERE sale_amount > 500) OVER (PARTITION BY sale_month) AS large_sales_count

FROM

    product_sales

ORDER BY

    sale_month;


### Result

sale_month

total_sales_count

large_sales_count

2025-01-01

3

1

2025-02-01

3

2


### Explanation

  • For January, the window contains three sales. The total_sales_count is correctly 3.

  • The large_sales_count function also operates on this same window, but the FILTER (WHERE sale_amount > 500) clause tells it to only count the rows that meet the condition. In January, only the sale of $600 qualifies, so the count is 1.


## The EXCLUDE Clause: Refining the Frame

The EXCLUDE clause lets you remove specific rows from the window frame after it has been defined by ROWS, RANGE, or GROUPS. Its most common use is to compare a row's value to the aggregate of its peers, excluding itself.

### SQL Setup and Sample Data

Let's use a table of employee salaries by department. Our goal is to see how each employee's salary compares to the average of everyone else in their department.

SQL

CREATE TABLE employee_salaries (

    department VARCHAR(50),

    employee_name VARCHAR(50),

    salary INT

);


INSERT INTO employee_salaries (department, employee_name, salary) VALUES

('Sales', 'Alice', 70000),

('Sales', 'Bob',   80000),

('Sales', 'Charlie', 90000),

('HR',    'Diana', 60000),

('HR',    'Eve',   65000);


### Query

This query shows the standard department average and a second average that excludes the current employee's salary.

SQL

SELECT

    department,

    employee_name,

    salary,

    -- Standard average for the whole department

    AVG(salary) OVER (PARTITION BY department) AS avg_dept_salary,

    -- Average of everyone else in the department

    AVG(salary) OVER (

        PARTITION BY department

        ORDER BY salary

        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED PRECEDING

        EXCLUDE CURRENT ROW

    ) AS avg_of_others

FROM

    employee_salaries;


### Result

department

employee_name

salary

avg_dept_salary

avg_of_others

HR

Diana

60000

62500

65000

HR

Eve

65000

62500

60000

Sales

Alice

70000

80000

85000

Sales

Bob

80000

80000

80000

Sales

Charlie

90000

80000

75000


### Explanation

Let's look at Alice in the 'Sales' department:

  • avg_dept_salary: The average of all three sales salaries (70000 + 80000 + 90000) / 3 is 80000.

  • avg_of_others: The EXCLUDE CURRENT ROW clause tells the function to remove Alice's own salary from the calculation. The average is therefore calculated only on Bob's and Charlie's salaries: (80000 + 90000) / 2 = 85000. This provides a more direct comparison of an individual to their peers.



No comments:

Post a Comment