Skip to main content

Window Function

The Window Function performs calculation across a set of related rows called a Partition, Group or a Window Frame.

When to use

Use window functions when you want to calculate Sum, Average, Rank, Percentile, and so on over a partition of data. Each group is considered a window and the rule is applied to every window. For instance, you can use the Window Function to find:

  1. Cumulating sum or average values across a set of rows
  2. Rank of rows in each group
  3. Difference between the previous and the current row
  4. Relationship among values within a group, amongst others.

Quickstart

Look at the following data:

Window function example

In the above example, the data is grouped in 4 groups. The rows in each group share the same entry in the column Group namely Mocktails, Starter, Main Course, and Ice-Cream.

Suppose we would like to do the following:

  1. Determine the total revenue for each group
  2. Regroup the data into 4 major revenue groups
  3. To calculate the total revenue for each of the four groups in Group, we will use the SUM, (Type= Aggregate) function on the Revenue column, partitioned by Group. Like this:

Sum over a group configuration

The data appears as shown below:

Sum over a group

You can can also calculate a running sum of Revenue over the same group.

Running total over a group

This is done by choosing the Type as Accumulate.

Running total over a group configuration

Note that grouping is NOT mandatory. You can calculate the running total of the entire data as a single window as well:

Running total without group

This is done by not adding grouping.

Running total without group configuration

note

Unlike the Group & Aggregate function, Window function does not return a single output row against each group. Instead, each row retains its separate identity and aggregated values are repeated across each of the rows.

  1. Regroup the data into 4 major revenue groups.

The data appears as shown below:

Regrouping data based on revenue

This is done by the NTile function. Note that here no group is chosen, and the only option used is sort by Revenue (Asc) column. This option decides the column based on which the data is sorted.

The configuration is shown below:

Regrouping data based on price configuration

Supported Options

Following options are provided by Window Functions:

  1. Calculate: To specify the function you want to use.
  2. Of Column: To specify the column which the function should use. This is only visible when the function selected in Calculate section requires a column.
  3. Add Group: To specify partitioning or grouping
  4. Type: To specify the type of operation over grouping that the function chosen in Calculate section will operate on. Supports two options, Accumulate over partition or Aggregate over partition
  5. N: To specify the number of buckets in the NTile function and the value of N in Nth_Value function. This option shows up only when NTILE or NTH_VALUE functions are selected.
  6. Add Sort: To specify the sort for the window. Check usage for sort in the above image example. Also see Sorting in tasks.
  7. Apply result into: With this you can apply the result into a new column or an existing one. See result documentation.
  8. Limit ranking: To limit the results to (or upto) a specified rank number. Say you set the Limit ranking to 3. You'll get rows with ranks - 1, 2 and 3 only. All other rows will be ignored. See Limit Ranking illustration here.

Types of Functions

The Window functions supported by Mammoth fall into three broad categories:

  1. Aggregation Functions
  2. Ranking Functions
  3. Relative Functions

Aggregation functions

Aggregation functions include functions like - Sum, Average, Maximum, Minimum, Standard deviation, Variance, and Count. All the calculations using the following functions can be divided by a group. In case a group is not specified, then the calculations are done considering the entire data as a single window.

Refer Quickstart for practical usage of the functions. Also get familiar with Accumulate over partition and Aggregate over partition before diving into the function definitions and usage.

Sum

This function calculates the sum for a given numeric column.

The data appears as shown below:

Calculating sum

The configuration is as shown below:

Calculating sum configuration

Check out how you can calculate running total over grouped & non-grouped data.

Average

This function calculates the average of a given numeric column.

The data appears as shown below:

Calculating running average

The configuration to calculate the average (aggregated) over grouping is shown below:

Calculating average (aggregate) configuration

You can also calculate running average by setting the Type as Accumulate.

Calculating average (accumulate) configuration

Configuration for Average (accumulate) is as follows:

Calculating running average

Maximum

This function determines the maximum value in a given numeric or date column.

The data appears as shown below:

Calculating Max value

The configuration is shown below:

Calculating Max value (aggregate) configuration

Calculate running maximum value by setting Type as Accumulate.

Calculating Max value

The configuration is shown below:

Calculating Max value (accumulate) configuration

Minimum

Similar to Maximum, calculate the minimum value in a given numeric or date column.

The data appears as shown below: Calculating Min value

The configuration is shown below:

Calculating Min value configuration

Calculate running minimum by setting Type as Accumulate.

Calculating Min value

The configuration is shown below:

Calculating Min value configuration

Standard Deviation

Standard deviation represents how far a data deviates from its average/mean, or expected value. A lower value means that most of the numbers are close to the average. A higher value means that the numbers are more spread out.

The data appears as shown below:

Calculating Standard Deviation

The configuration is shown below:

Calculating Standard Deviation configuration

Calculate the running Standard Deviation by setting Type as Accumulate.

Calculating Standard Deviation

The configuration is shown below:

Calculating Standard Deviation configuration

Variance

Variance is used for measuring the spread of data, similar to Standard Deviation. It grows very large when there is uneven spread in the distribution of data.

The data appears as shown below:

Calculating Variance

The configuration is shown below:

Calculating Variance Configuration

You can calculate running Variance by choosing Accumulate as Type.

Calculating Variance Configuration

The configuration is shown below:

Calculating Variance Configuration

Count

Count represents the number of occurrences of a value in any column.

The data apepars as shown below:

Calculating Count

The configuration is as below:

Calculating Count configuration

You can calculate running count by choosing Accumulate as setting Type.

Calculating Count configuration

The configuration is as below:

Calculating Count configuration

Accumulate over partition

When you create an operation that accumulates over grouping, it would create values in rows that successively apply the same operation over each row and accumulate that value. In the below table the Accumulated column shows marks Accumulated over the Student group as sum

StudentSubjectMarksAccumulated
AliceMath8787
AliceScience92179
AliceEnglish87266
AliceEconomics89355
MorrisMath9393
MorrisScience89182
MorrisEnglish82264
MorrisEconomics97361
Aggregate over partition

When you create an operation that aggregates over grouping, it would create values in rows that provide the result of operation over all values across all rows for a column. In the below table the Aggregated column shows marks Aggregated over the Student group as sum

StudentSubjectMarksAggregated
AliceMath87355
AliceScience92355
AliceEnglish87355
AliceEconomics89355
MorrisMath93361
MorrisScience89361
MorrisEnglish82361
MorrisEconomics97361

Ranking functions

The following ranking functions are supported By Mammoth. Ranking functions always require a column on which the data is sorted.

Row Number

Row number evaluates the number of the current row within its group.

The data appears as below:

Row number

The configuration to achieve this is shown below:

Row number Configuration

Rank

The Rank function ranks the rows based on values in a column, but with gaps. Here, gaps mean that if there are multiple rows with same values for the selected column, the subsequent number that was repeated would be skipped. For example, In a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 4.

A good use case would be to calculate the best selling product in each group as demonstrated in the example. The Add group option decides the grouping, while the sort option decides the column based on which ranks are assigned, and the sort order of the data.

The data appears as shown below:

Rank

The configuration to achieve this is shown below:

Rank Configuration

You can choose to limit results with the Limit ranking option. Like this:

Rank Configuration

The data appears as follows:

Rank Configuration

Dense Rank

Similar to rank, but without gaps. For example in a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 3.

The data appears as shown below:

Dense Rank

The configuration to achieve this is shown below:

Dense Rank Configuration

Percent Rank

This represents the relative rank of the current row. Mathematically represented as: (rank - 1) / (total rows - 1). The value of percent rank ranges from 0-1, where the first row always has '0' as the value. The best and least grossing products in the given data is computed with this function.

The data appears as shown below:

Percent Rank data

The configuration is as shown below:

Percent Rank Configuration

Cumulative Percent Rank

Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows). Check the example below, from the cumulative percent rank values, it is evident that over 55% products generate a revenue more than 600.

The data appears as shown below:

Cumulative Percent Rank

The configuration is as shown below:

Cumulative Percent Rank Configuration

Ntile

The Ntile function tries to segregate the data into buckets as equally as possible. The number of buckets created is defined by the value of N. The size of each bucket is calculated by dividing the total number of rows by N, and is chosen as the nearest integer <= the quotient.

Let’s understand this with the help of a dataset. Suppose we have this dataset and we want to segregate this in 4 revenue groups. We would specify N as 4.

Example dataset

Since here the total number of rows is 13, the system will create 4 buckets each of size 13/4= 3.22, i.e. approx 3 rows in each bucket. The remaining rows will be adjusted equally in each bucket, starting from bucket 1.

The data appears as shown below:

Ntile

The configuration is shown below:

Ntile configuration

note
  1. When N>=the total number of rows, the system creates as many buckets as possible. Say you have a dataset with 20 rows and you try to create 100 buckets. The system will try to create the largest possible number of buckets, here 20. The size of each bucket will be 1 in this case.
  2. To calculate percentile, you must have a dataset with more than 100 rows in that group. Setting N to 100 will not show you percentile otherwise.

Relative functions

First Value

It displays the first value of a given numeric or date column in an ordered dataset. In the given example, we are determining the least grossing product in every group.

The data appears as shown below:

First Value

Determine the least grossing product by using sort. This option determines the value by which first value is determined, and the sort order.

The configuration is shown below:

First Value configuration

You can also calculate the running first value by setting the Type as Accumulate.

Last Value

Simlar to First Value, this function is used to determine the last value in a given numeric or date column in a group.

The data appears as shown below:

Last Value

The configuration is shown below: Last Value configuration

You can also calculate the running last value by setting the Type as Accumulate.

Nth Value

The Nth Value allows you to get a value from the Nth row in an ordered set of rows. In the example, we have set the value of 'N' to 2, applied partitioning by Group Name and sorted Revenue column in descending order to determine the second best seller.

The data appears as shown below:

Nth Value Value

The configuration is shown below:

Nth Value Configuration

You can also calculate the running Nth value by setting the Type as Accumulate.

Lead

Find the value in a given column in the next row. This will be empty for the last row in a given window frame. The Lead function can be very useful for comparing the value of the current row with the value of the following row.

The data appears as shown below:

Lead Value Value

The configuration is shown below:

Lead Value Configuration

Lag

Find the value in a given column in the previous row. This will be empty for the first row in a given window frame.

The data appears as shown below:

Lag Value Value

The configuration is shown below:

Lag Value Configuration

::: info