Window Function
The Window Function performs calculation across a set of related rows called a Partition, Group or a Window Frame.
When to use
Use window functions when you want to calculate Sum, Average, Rank, Percentile, and so on over a partition of data. Each group is considered a window and the rule is applied to every window. For instance, you can use the Window Function to find:
- Cumulating sum or average values across a set of rows
- Rank of rows in each group
- Difference between the previous and the current row
- Relationship among values within a group, amongst others.
Quickstart
Look at the following data:
In the above example, the data is grouped in 4 groups. The rows in each group share the same entry in the column Group
namely Mocktails, Starter, Main Course, and Ice-Cream.
Suppose we would like to do the following:
- Determine the total revenue for each group
- Regroup the data into 4 major revenue groups
- To calculate the total revenue for each of the four groups in
Group
, we will use theSUM
, (Type= Aggregate) function on theRevenue
column, partitioned byGroup
. Like this:
The data appears as shown below:
You can can also calculate a running sum of Revenue
over the same group.
This is done by choosing the Type
as Accumulate.
Note that grouping is NOT mandatory. You can calculate the running total of the entire data as a single window as well:
This is done by not adding grouping.
Unlike the Group & Aggregate function, Window function does not return a single output row against each group. Instead, each row retains its separate identity and aggregated values are repeated across each of the rows.
- Regroup the data into 4 major revenue groups.
The data appears as shown below:
This is done by the NTile function. Note that here no group is chosen, and the only option used is sort
by Revenue (Asc)
column. This option decides the column based on which the data is sorted.
The configuration is shown below:
Supported Options
Following options are provided by Window Functions:
Calculate
: To specify the function you want to use.Of Column
: To specify the column which the function should use. This is only visible when the function selected inCalculate
section requires a column.Add Group
: To specify partitioning or groupingType
: To specify the type of operation over grouping that the function chosen inCalculate
section will operate on. Supports two options, Accumulate over partition or Aggregate over partitionN
: To specify the number of buckets in theNTile
function and the value of N inNth_Value
function. This option shows up only whenNTILE
orNTH_VALUE
functions are selected.Add Sort
: To specify the sort for the window. Check usage for sort in the above image example. Also see Sorting in tasks.Apply result into
: With this you can apply the result into a new column or an existing one. See result documentation.Limit ranking
: To limit the results to (or upto) a specified rank number. Say you set theLimit ranking
to 3. You'll get rows with ranks - 1, 2 and 3 only. All other rows will be ignored. See Limit Ranking illustration here.
Types of Functions
The Window functions supported by Mammoth fall into three broad categories:
Aggregation functions
Aggregation functions include functions like - Sum, Average, Maximum, Minimum, Standard deviation, Variance, and Count. All the calculations using the following functions can be divided by a group. In case a group is not specified, then the calculations are done considering the entire data as a single window.
Refer Quickstart for practical usage of the functions. Also get familiar with Accumulate over partition and Aggregate over partition before diving into the function definitions and usage.
Sum
This function calculates the sum for a given numeric column.
The data appears as shown below:
The configuration is as shown below:
Check out how you can calculate running total over grouped & non-grouped data.
Average
This function calculates the average of a given numeric column.
The data appears as shown below:
The configuration to calculate the average (aggregated) over grouping is shown below:
You can also calculate running average by setting the Type
as Accumulate
.
Configuration for Average (accumulate) is as follows:
Maximum
This function determines the maximum value in a given numeric or date column.
The data appears as shown below:
The configuration is shown below:
Calculate running maximum value by setting Type
as Accumulate
.
The configuration is shown below:
Minimum
Similar to Maximum, calculate the minimum value in a given numeric or date column.
The data appears as shown below:
The configuration is shown below:
Calculate running minimum by setting Type
as Accumulate
.
The configuration is shown below:
Standard Deviation
Standard deviation represents how far a data deviates from its average/mean, or expected value. A lower value means that most of the numbers are close to the average. A higher value means that the numbers are more spread out.
The data appears as shown below:
The configuration is shown below:
Calculate the running Standard Deviation by setting Type
as Accumulate
.
The configuration is shown below:
Variance
Variance is used for measuring the spread of data, similar to Standard Deviation. It grows very large when there is uneven spread in the distribution of data.
The data appears as shown below:
The configuration is shown below:
You can calculate running Variance by choosing Accumulate
as Type
.
The configuration is shown below:
Count
Count represents the number of occurrences of a value in any column.
The data apepars as shown below:
The configuration is as below:
You can calculate running count by choosing Accumulate
as setting Type
.
The configuration is as below:
Accumulate over partition
When you create an operation that accumulates over grouping, it would create values in rows that successively apply the same operation over each row and accumulate that value. In the below table the Accumulated column shows marks Accumulated over the Student group as sum
Student | Subject | Marks | Accumulated |
---|---|---|---|
Alice | Math | 87 | 87 |
Alice | Science | 92 | 179 |
Alice | English | 87 | 266 |
Alice | Economics | 89 | 355 |
Morris | Math | 93 | 93 |
Morris | Science | 89 | 182 |
Morris | English | 82 | 264 |
Morris | Economics | 97 | 361 |
Aggregate over partition
When you create an operation that aggregates over grouping, it would create values in rows that provide the result of operation over all values across all rows for a column. In the below table the Aggregated column shows marks Aggregated over the Student group as sum
Student | Subject | Marks | Aggregated |
---|---|---|---|
Alice | Math | 87 | 355 |
Alice | Science | 92 | 355 |
Alice | English | 87 | 355 |
Alice | Economics | 89 | 355 |
Morris | Math | 93 | 361 |
Morris | Science | 89 | 361 |
Morris | English | 82 | 361 |
Morris | Economics | 97 | 361 |
Ranking functions
The following ranking functions are supported By Mammoth. Ranking functions always require a column on which the data is sorted.
Row Number
Row number evaluates the number of the current row within its group.
The data appears as below:
The configuration to achieve this is shown below:
Rank
The Rank
function ranks the rows based on values in a column, but with gaps. Here, gaps mean that if there are multiple rows with same values for the selected column, the subsequent number that was repeated would be skipped. For example, In a series [A,B,B,C,D], the rank of 2nd B
would be 2 while rank of C would be 4.
A good use case would be to calculate the best selling product in each group as demonstrated in the example. The Add group
option decides the grouping, while the sort
option decides the column based on which ranks are assigned, and the sort order of the data.
The data appears as shown below:
The configuration to achieve this is shown below:
You can choose to limit results with the Limit ranking
option. Like this:
The data appears as follows:
Dense Rank
Similar to rank, but without gaps. For example in a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 3.
The data appears as shown below:
The configuration to achieve this is shown below:
Percent Rank
This represents the relative rank of the current row. Mathematically represented as: (rank - 1) / (total rows - 1). The value of percent rank ranges from 0-1, where the first row always has '0' as the value. The best and least grossing products in the given data is computed with this function.
The data appears as shown below:
The configuration is as shown below:
Cumulative Percent Rank
Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows). Check the example below, from the cumulative percent rank values, it is evident that over 55% products generate a revenue more than 600.
The data appears as shown below:
The configuration is as shown below:
Ntile
The Ntile function tries to segregate the data into buckets as equally as possible. The number of buckets created is defined by the value of N. The size of each bucket is calculated by dividing the total number of rows by N, and is chosen as the nearest integer <= the quotient.
Let’s understand this with the help of a dataset. Suppose we have this dataset and we want to segregate this in 4 revenue groups. We would specify N as 4.
Since here the total number of rows is 13, the system will create 4 buckets each of size 13/4= 3.22, i.e. approx 3 rows in each bucket. The remaining rows will be adjusted equally in each bucket, starting from bucket 1.
The data appears as shown below:
The configuration is shown below:
- When N>=the total number of rows, the system creates as many buckets as possible. Say you have a dataset with 20 rows and you try to create 100 buckets. The system will try to create the largest possible number of buckets, here 20. The size of each bucket will be 1 in this case.
- To calculate percentile, you must have a dataset with more than 100 rows in that group. Setting N to 100 will not show you percentile otherwise.
Relative functions
First Value
It displays the first value of a given numeric or date column in an ordered dataset. In the given example, we are determining the least grossing product in every group.
The data appears as shown below:
Determine the least grossing product by using sort
. This option determines the value by which first value is determined, and the sort order.
The configuration is shown below:
You can also calculate the running first value by setting the Type
as Accumulate
.
Last Value
Simlar to First Value, this function is used to determine the last value in a given numeric or date column in a group.
The data appears as shown below:
The configuration is shown below:
You can also calculate the running last value by setting the Type
as Accumulate
.
Nth Value
The Nth Value allows you to get a value from the Nth row in an ordered set of rows. In the example, we have set the value of 'N' to 2, applied partitioning by Group Name
and sorted Revenue
column in descending order to determine the second best seller.
The data appears as shown below:
The configuration is shown below:
You can also calculate the running Nth value by setting the Type
as Accumulate
.
Lead
Find the value in a given column in the next
row. This will be empty for the last
row in a given window frame. The Lead
function can be very useful for comparing the value of the current row with the value of the following row.
The data appears as shown below:
The configuration is shown below:
Lag
Find the value in a given column in the previous
row. This will be empty for the first
row in a given window frame.
The data appears as shown below:
The configuration is shown below:
::: info