📈 Percentile & Distribution Functions: PERCENT_RANK(), CUME_DIST(), and NTILE()

This guide covers SQL distribution functions like PERCENT_RANK(), CUME_DIST(), and NTILE(). You’ll learn how to assign percentiles, calculate cumulative rank, and split data into ranked tiers using window functions.


🚀 Understanding Relative Position in SQL

"Xorthax isn’t just interested in who’s first—he wants to know how far ahead they are in the galactic rankings. Whether it’s finding percentiles, calculating distributions, or creating tiered categories, these functions do the trick."


🤔 What Are PERCENT_RANK(), CUME_DIST(), and NTILE()?

These functions calculate relative standing within a partition:

  • PERCENT_RANK() gives a value from 0 to 1 indicating a row’s percentile rank
  • CUME_DIST() gives a value from 0 to 1 indicating the cumulative distribution
  • NTILE(n) splits rows into n buckets (e.g. quartiles, deciles)

Each helps understand "how far along" a row is within its ordered group.


🔧 Syntax Snapshot

SQL example: Syntax for PERCENT_RANK(), CUME_DIST(), and NTILE() window functions used to calculate percentile ranks, cumulative distribution, and quantile buckets within a group.
PERCENT_RANK() OVER (PARTITION BY group ORDER BY metric)
CUME_DIST()   OVER (PARTITION BY group ORDER BY metric)
NTILE(4)      OVER (PARTITION BY group ORDER BY metric)

📋 See It in Action

🧾 Sample Data

CustomerID Region OrderAmount
C1 North 100
C2 North 200
C3 North 200
C4 North 400

🔎 Query Example

SQL example: Calculate the percentile rank, cumulative distribution, and quantile tier for each customer's order within their region using PERCENT_RANK(), CUME_DIST(), and NTILE().
SELECT CustomerID, Region, OrderAmount,
       PERCENT_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount) AS PercentRank,
       CUME_DIST()    OVER (PARTITION BY Region ORDER BY OrderAmount) AS CumeDist,
       NTILE(3)       OVER (PARTITION BY Region ORDER BY OrderAmount) AS Tier
FROM Orders;

📊 Result

CustomerID OrderAmount PercentRank CumeDist Tier
C1 100 0.0 0.25 1
C2 200 0.333 0.50 2
C3 200 0.333 0.75 2
C4 400 1.0 1.0 3

🔍 How SQL Calculates These:

  • PERCENT_RANK() is based on the formula: (rank - 1) / (total_rows - 1). Rows with the same value share a rank, so C2 and C3 have the same PercentRank.
  • CUME_DIST() reflects the proportion of rows less than or equal to the current row. For example, C3 is the 3rd of 4 rows (≤ 200), so its CUME_DIST is 0.75.
  • NTILE(3) divides the rows as evenly as possible into 3 buckets. With 4 rows, SQL assigns 1 row to the first and third groups, and 2 rows to the middle group—starting with the lowest values and filling tiers in order.

🧭 When Should You Use Each?

  • Use PERCENT_RANK() to answer: "What percentile is this row in?"
  • Use CUME_DIST() to answer: "What portion of the group is less than or equal to this row?"
  • Use NTILE(n) to bucket rows into equal groups for analysis (quartiles, deciles, etc.)

🧠 Keywords That Hint You're in the Right Place

  • "Percentile"
  • "Top 10%"
  • "Quartile"
  • "How far into the group"
  • "Split into tiers"

⚠️ Common Pitfalls

  • ❌ Using PERCENT_RANK() expecting uniform distribution—it depends on row count and ties.

    Example: With 4 rows, the rank values might be 1, 2, 2, and 4. Using the formula (rank - 1) / (total_rows - 1) results in PercentRanks like 0.0, 0.333, 0.333, and 1.0—not a uniform spread. Tied values share the same rank, which compresses the distribution unevenly.

  • ❌ Assuming NTILE(n) gives exactly equal-sized groups when row counts don’t divide evenly

    Example: With 4 rows and NTILE(3), SQL will assign the first row to Tier 1, the next two rows to Tier 2, and the last row to Tier 3. The groups aren't perfectly equal—some may have more rows than others depending on the row count and how evenly it divides.

  • ❌ Forgetting to ORDER BY something meaningful in the window clause

🔄 CTEs + Distribution Filtering

Want to find all customers in the top 10% by revenue?

SQL example: Use a CTE and PERCENT_RANK() to filter for the top 10% of customers by order amount within each region.
WITH Ranked AS (
  SELECT *,
         PERCENT_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS pr
  FROM Orders
)
SELECT * FROM Ranked WHERE pr <= 0.10;

🔍 Explanation: This query uses a Common Table Expression (CTE) to calculate PERCENT_RANK() within each region, ordering by OrderAmount from highest to lowest. Then, in the outer query, it filters the results to only include rows where the percentile rank is in the top 10% (i.e., pr <= 0.10).

A CTE is required here because you can't filter directly on a window function like PERCENT_RANK() in the same SELECT block where it's calculated—doing so would result in an error. The CTE allows you to evaluate the function first, then filter in the outer query.


➕ Combine with Aggregates

Use NTILE() or CUME_DIST() to build segments for summary:

SQL example: Use NTILE(4) to divide orders into quartiles, then calculate the average spend per quartile using GROUP BY and AVG().
WITH Tiered AS (
  SELECT *, NTILE(4) OVER (ORDER BY OrderAmount) AS Quartile
  FROM Orders
)
SELECT Quartile, AVG(OrderAmount) AS AvgSpend
FROM Tiered
GROUP BY Quartile;

🔍 Explanation: This query uses NTILE(4) to divide all rows into four approximate quartiles based on OrderAmount. Then it calculates the average for each quartile using GROUP BY. This makes it easy to compare how spending differs across tiers.

🧰 Debugging Tips

  • Watch how ties affect percentile placement — Tied values can compress the range of ranks and distort spacing. Be cautious when interpreting distributions with many duplicates.
  • Add ROW_NUMBER() for insight into row order
  • Use small datasets first to test edge behavior


🔗 Related Topics

(See also: RANK, Aggregate Window Functions, Frame Clauses)

Bootstrap JS is working!