📈 Percentile & Distribution Functions: PERCENT_RANK(), CUME_DIST(), and NTILE()

🚀 Understanding Relative Position in SQL

"Xorthax isn’t just interested in who’s first—he wants to know how far ahead they are in the galactic rankings. Whether it’s finding percentiles, calculating distributions, or creating tiered categories, these functions do the trick."

🤔 What Are PERCENT_RANK(), CUME_DIST(), and NTILE()?

These functions calculate relative standing within a partition:

PERCENT_RANK() gives a value from 0 to 1 indicating a row’s percentile rank
CUME_DIST() gives a value from 0 to 1 indicating the cumulative distribution
NTILE(n) splits rows into n buckets (e.g. quartiles, deciles)

Each helps understand "how far along" a row is within its ordered group.

🔧 Syntax Snapshot

PERCENT_RANK() OVER (PARTITION BY group ORDER BY metric)
CUME_DIST()   OVER (PARTITION BY group ORDER BY metric)
NTILE(4)      OVER (PARTITION BY group ORDER BY metric)

📋 See It in Action

🧾 Sample Data

CustomerID	Region	OrderAmount
C1	North	100
C2	North	200
C3	North	200
C4	North	400

🔎 Query Example

SELECT CustomerID, Region, OrderAmount,
       PERCENT_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount) AS PercentRank,
       CUME_DIST()    OVER (PARTITION BY Region ORDER BY OrderAmount) AS CumeDist,
       NTILE(3)       OVER (PARTITION BY Region ORDER BY OrderAmount) AS Tier
FROM Orders;

📊 Result

CustomerID	OrderAmount	PercentRank	CumeDist	Tier
C1	100	0.0	0.25	1
C2	200	0.333	0.50	2
C3	200	0.333	0.75	2
C4	400	1.0	1.0	3

🔍 How SQL Calculates These:

PERCENT_RANK() is based on the formula: (rank - 1) / (total_rows - 1). Rows with the same value share a rank, so C2 and C3 have the same PercentRank.
CUME_DIST() reflects the proportion of rows less than or equal to the current row. For example, C3 is the 3rd of 4 rows (≤ 200), so its CUME_DIST is 0.75.
NTILE(3) divides the rows as evenly as possible into 3 buckets. With 4 rows, SQL assigns 1 row to the first and third groups, and 2 rows to the middle group—starting with the lowest values and filling tiers in order.

🧭 When Should You Use Each?

Use PERCENT_RANK() to answer: "What percentile is this row in?"
Use CUME_DIST() to answer: "What portion of the group is less than or equal to this row?"
Use NTILE(n) to bucket rows into equal groups for analysis (quartiles, deciles, etc.)

🧠 Keywords That Hint You're in the Right Place

"Percentile"
"Top 10%"
"Quartile"
"How far into the group"
"Split into tiers"

⚠️ Common Pitfalls

❌ Using PERCENT_RANK() expecting uniform distribution—it depends on row count and ties.

Example: With 4 rows, the rank values might be 1, 2, 2, and 4. Using the formula (rank - 1) / (total_rows - 1) results in PercentRanks like 0.0, 0.333, 0.333, and 1.0—not a uniform spread. Tied values share the same rank, which compresses the distribution unevenly.
❌ Assuming NTILE(n) gives exactly equal-sized groups when row counts don’t divide evenly

Example: With 4 rows and NTILE(3), SQL will assign the first row to Tier 1, the next two rows to Tier 2, and the last row to Tier 3. The groups aren't perfectly equal—some may have more rows than others depending on the row count and how evenly it divides.
❌ Forgetting to ORDER BY something meaningful in the window clause

🔄 CTEs + Distribution Filtering

Want to find all customers in the top 10% by revenue?

WITH Ranked AS (
  SELECT *,
         PERCENT_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS pr
  FROM Orders
)
SELECT * FROM Ranked WHERE pr <= 0.10;

🔍 Explanation: This query uses a Common Table Expression (CTE) to calculate PERCENT_RANK() within each region, ordering by OrderAmount from highest to lowest. Then, in the outer query, it filters the results to only include rows where the percentile rank is in the top 10% (i.e., pr <= 0.10).

A CTE is required here because you can't filter directly on a window function like PERCENT_RANK() in the same SELECT block where it's calculated—doing so would result in an error. The CTE allows you to evaluate the function first, then filter in the outer query.

➕ Combine with Aggregates

Use NTILE() or CUME_DIST() to build segments for summary:

WITH Tiered AS (
  SELECT *, NTILE(4) OVER (ORDER BY OrderAmount) AS Quartile
  FROM Orders
)
SELECT Quartile, AVG(OrderAmount) AS AvgSpend
FROM Tiered
GROUP BY Quartile;

🔍 Explanation: This query uses NTILE(4) to divide all rows into four approximate quartiles based on OrderAmount. Then it calculates the average for each quartile using GROUP BY. This makes it easy to compare how spending differs across tiers.

🧰 Debugging Tips

Watch how ties affect percentile placement — Tied values can compress the range of ranks and distort spacing. Be cautious when interpreting distributions with many duplicates.
Add ROW_NUMBER() for insight into row order
Use small datasets first to test edge behavior

🧪 Practice Problems by Concept

🔗 Related Topics

(See also: RANK, Aggregate Window Functions, Frame Clauses)