# Window Function

The Window Function performs calculation across a set of related rows called a **Partition**, **Group** or a **Window Frame**.

## When to use

Use window functions when you want to calculate **Sum, Average, Rank, Percentile**, and so on over a partition of data. Each group is considered a window and the rule is applied to every window.
For instance, you can use the Window Function to find:

- Cumulating sum or average values across a set of rows
- Rank of rows in each group
- Difference between the previous and the current row
- Relationship among values within a group, amongst others.

## Quickstart

Look at the following data:

In the above example, the data is grouped in 4 groups. The rows in each group share the same entry in the column `Group`

namely *Mocktails, Starter, Main Course*, and *Ice-Cream*.

Suppose we would like to do the following:

- Determine the total revenue for each group
- Regroup the data into 4 major revenue groups
- To calculate the total revenue for each of the four groups in
`Group`

, we will use the`SUM`

, (Type= Aggregate) function on the`Revenue`

column, partitioned by`Group`

. Like this:

The data appears as shown below:

You can can also calculate a running sum of `Revenue`

over the same group.

This is done by choosing the `Type`

as Accumulate.

Note that grouping is NOT mandatory. You can calculate the running total of the entire data as a single window as well:

This is done by not adding grouping.

Unlike the Group & Aggregate function, Window function does not return a single output row against each group. Instead, each row retains its separate identity and aggregated values are repeated across each of the rows.

- Regroup the data into 4 major revenue groups.

The data appears as shown below:

This is done by the NTile function. Note that here no group is chosen, and the only option used is `sort`

by `Revenue (Asc)`

column. This option decides the column based on which the data is sorted.

The configuration is shown below:

## Supported Options

Following options are provided by Window Functions:

`Calculate`

: To specify the function you want to use.`Of Column`

: To specify the column which the function should use. This is only visible when the function selected in`Calculate`

section requires a column.`Add Group`

: To specify partitioning or grouping`Type`

: To specify the type of operation over grouping that the function chosen in`Calculate`

section will operate on. Supports two options, Accumulate over partition or Aggregate over partition`N`

: To specify the number of buckets in the`NTile`

function and the value of N in`Nth_Value`

function. This option shows up only when`NTILE`

or`NTH_VALUE`

functions are selected.`Add Sort`

: To specify the sort for the window. Check usage for sort in the above image example. Also see Sorting in tasks.`Apply result into`

: With this you can apply the result into a new column or an existing one. See result documentation.`Limit ranking`

: To limit the results to (or upto) a specified rank number. Say you set the`Limit ranking`

to 3. You'll get rows with ranks - 1, 2 and 3 only. All other rows will be ignored. See Limit Ranking illustration here.

## Types of Functions

The Window functions supported by Mammoth fall into three broad categories:

### Aggregation functions

Aggregation functions include functions like - Sum, Average, Maximum, Minimum, Standard deviation, Variance, and Count. All the calculations using the following functions can be divided by a group. In case a group is not specified, then the calculations are done considering the entire data as a single window.

Refer Quickstart for practical usage of the functions. Also get familiar with Accumulate over partition and Aggregate over partition before diving into the function definitions and usage.

#### Sum

This function calculates the sum for a given numeric column.

The data appears as shown below:

The configuration is as shown below:

Check out how you can calculate running total over grouped & non-grouped data.

#### Average

This function calculates the average of a given numeric column.

The data appears as shown below:

The configuration to calculate the average (aggregated) over grouping is shown below:

You can also calculate running average by setting the `Type`

as `Accumulate`

.

Configuration for Average (accumulate) is as follows:

#### Maximum

This function determines the maximum value in a given numeric or date column.

The data appears as shown below:

The configuration is shown below:

Calculate running maximum value by setting `Type`

as `Accumulate`

.

The configuration is shown below:

#### Minimum

Similar to Maximum, calculate the minimum value in a given numeric or date column.

The data appears as shown below:

The configuration is shown below:

Calculate running minimum by setting `Type`

as `Accumulate`

.

The configuration is shown below:

#### Standard Deviation

Standard deviation represents how far a data deviates from its average/mean, or expected value. A lower value means that most of the numbers are close to the average. A higher value means that the numbers are more spread out.

The data appears as shown below:

The configuration is shown below:

Calculate the running Standard Deviation by setting `Type`

as `Accumulate`

.

The configuration is shown below:

#### Variance

Variance is used for measuring the spread of data, similar to Standard Deviation. It grows very large when there is uneven spread in the distribution of data.

The data appears as shown below:

The configuration is shown below:

You can calculate running Variance by choosing `Accumulate`

as `Type`

.

The configuration is shown below:

#### Count

Count represents the number of occurrences of a value in any column.

The data apepars as shown below:

The configuration is as below:

You can calculate running count by choosing `Accumulate`

as setting `Type`

.

The configuration is as below:

##### Accumulate over partition

When you create an operation that accumulates over grouping, it would create values in rows that successively apply the same operation over each row and accumulate that value. In the below table the **Accumulated** column shows marks Accumulated over the Student group as sum

Student | Subject | Marks | Accumulated |
---|---|---|---|

Alice | Math | 87 | 87 |

Alice | Science | 92 | 179 |

Alice | English | 87 | 266 |

Alice | Economics | 89 | 355 |

Morris | Math | 93 | 93 |

Morris | Science | 89 | 182 |

Morris | English | 82 | 264 |

Morris | Economics | 97 | 361 |

##### Aggregate over partition

When you create an operation that aggregates over grouping, it would create values in rows that provide the result of operation over all values across all rows for a column. In the below table the **Aggregated** column shows marks Aggregated over the Student group as sum

Student | Subject | Marks | Aggregated |
---|---|---|---|

Alice | Math | 87 | 355 |

Alice | Science | 92 | 355 |

Alice | English | 87 | 355 |

Alice | Economics | 89 | 355 |

Morris | Math | 93 | 361 |

Morris | Science | 89 | 361 |

Morris | English | 82 | 361 |

Morris | Economics | 97 | 361 |

### Ranking functions

The following ranking functions are supported By Mammoth. Ranking functions always require a column on which the data is sorted.

#### Row Number

Row number evaluates the number of the current row within its group.

The data appears as below:

The configuration to achieve this is shown below:

#### Rank

The `Rank`

function ranks the rows based on values in a column, but with gaps. Here, gaps mean that if there are multiple rows with same values for the selected column, the subsequent number that was repeated would be skipped. For example, In a series [A,B,B,C,D], the rank of 2nd `B`

would be 2 while rank of C would be 4.

A good use case would be to calculate the best selling product in each group as demonstrated in the example. The `Add group`

option decides the grouping, while the `sort`

option decides the column based on which ranks are assigned, and the sort order of the data.

The data appears as shown below:

The configuration to achieve this is shown below:

You can choose to limit results with the `Limit ranking`

option. Like this:

The data appears as follows:

#### Dense Rank

Similar to rank, but without gaps. For example in a series [A,B,B,C,D], the rank of 2nd B would be 2 while rank of C would be 3.

The data appears as shown below:

The configuration to achieve this is shown below:

#### Percent Rank

This represents the relative rank of the current row. Mathematically represented as: (rank - 1) / (total rows - 1). The value of percent rank ranges from 0-1, where the first row always has '0' as the value. The best and least grossing products in the given data is computed with this function.

The data appears as shown below:

The configuration is as shown below:

#### Cumulative Percent Rank

Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows). Check the example below, from the cumulative percent rank values, it is evident that over 55% products generate a revenue more than 600.

The data appears as shown below:

The configuration is as shown below:

#### Ntile

The Ntile function tries to segregate the data into buckets as equally as possible. The number of buckets created is defined by the value of N. The size of each bucket is calculated by dividing the total number of rows by N, and is chosen as the nearest integer <= the quotient.

Let’s understand this with the help of a dataset. Suppose we have this dataset and we want to segregate this in 4 revenue groups. We would specify N as 4.

Since here the total number of rows is 13, the system will create 4 buckets each of size 13/4= 3.22, i.e. approx 3 rows in each bucket. The remaining rows will be adjusted equally in each bucket, starting from bucket 1.

The data appears as shown below:

The configuration is shown below:

- When N>=the total number of rows, the system creates as many buckets as possible. Say you have a dataset with 20 rows and you try to create 100 buckets. The system will try to create the largest possible number of buckets, here 20. The size of each bucket will be 1 in this case.
- To calculate percentile, you must have a dataset with more than 100 rows in that group. Setting N to 100 will not show you percentile otherwise.

### Relative functions

#### First Value

It displays the first value of a given numeric or date column in an ordered dataset. In the given example, we are determining the least grossing product in every group.

The data appears as shown below:

Determine the least grossing product by using `sort`

. This option determines the value by which first value is determined, and the sort order.

The configuration is shown below:

You can also calculate the running first value by setting the `Type`

as `Accumulate`

.

#### Last Value

Simlar to First Value, this function is used to determine the last value in a given numeric or date column in a group.

The data appears as shown below:

The configuration is shown below:

You can also calculate the running last value by setting the `Type`

as `Accumulate`

.

#### Nth Value

The Nth Value allows you to get a value from the Nth row in an ordered set of rows. In the example, we have set the value of 'N' to 2, applied partitioning by `Group Name`

and sorted `Revenue`

column in descending order to determine the second best seller.

The data appears as shown below:

The configuration is shown below:

You can also calculate the running Nth value by setting the `Type`

as `Accumulate`

.

#### Lead

Find the value in a given column in the `next`

row. This will be empty for the `last`

row in a given window frame. The `Lead`

function can be very useful for comparing the value of the current row with the value of the following row.

The data appears as shown below:

The configuration is shown below:

#### Lag

Find the value in a given column in the `previous`

row. This will be empty for the `first`

row in a given window frame.

The data appears as shown below:

The configuration is shown below:

::: info