Skip to main content

Troubleshooting Joins

When working with disparate Datasets, rarely does it happen that you find all the information you need at one place and also neatly structured. (Lucky, if you do.) So, what happens in an usual case?

Typically different pieces of data are stored in different tables. Depending on the needs, you may need to extract subsets of data from different sources or tables and join them together. You can use a Join rule. This is the most common and most powerful rule to get data organized.

It is, however, also quite misunderstood and sometimes, leads to unwanted results. Today we are going to unravel the two most common traps people fall into when working with Joins.

Join basics

Before we go further, let us quickly see when do you do joins through a practical example. Suppose you have the following table containing marks against student ID and names:

IDNameMarks
1Adam46
2Adam42
3John44
4Mary45

You have another table containing student ID, names and other personal details like gender, age, class, etc.

IDNameGenderAgeClass
1AdamM10V
2AdamM11VI
3JohnM11VI
4MaryF10V

You want to see them together in one table as follows:

IDNameGenderAgeClassMarks
1AdamM10V46
2AdamM11VI42
3JohnM11VI44
4MaryF10V45

To achieve the above, you would join the two tables using the Join rule.

To receive data from the second table, you would add columns to the first table. Then you will match ID in both the tables to fill in the corresponding rows.

Notice that we are matching IDs to join here. The matching columns are called Primary keys or Keys in Join. To know more about the various kinds of Joins read here.

2 common Join traps

Out of all the things that could go wrong with Join, these two problems occur more frequently than any other:

  1. Get more than expected rows
  2. Joined columns have unexpected empty values

1. More than expected rows:

A Join rule can result in a lot more rows in your data. Sometimes, this is the desired behaviour, but most often it is from a mistake in how the join is specified.

This all depends on the key you pickup for the Join.

Confused? Let me break it down even further.

Suppose you are LEFT joining. In the example above, imagine that you picked the name as the key instead of ID. You would immediately see results with bloated rows:

IDNameGenderAgeClassMarks
1AdamM10V46
1AdamM11VI42
2AdamM10V46
2AdamM11VI42
3JohnM11VI44
4MaryF10V45

Suddenly, you have 6 rows of data. This is so because you had repeating value of the name Adam in both the tables, and so for each value of it in first table, both values are picked up from the second table thus making 4 rows.

note

In Mammoth, the system blocks exceptionally huge row count surge with an error called the row count bloating error.

2. Empty rows:

This is another Join trap you can fall into. In Joins, you may encounter that for certain rows the results are unexpectedly blank. It happens when the keys in the two tables don't really have matching strings as it might seem. There may be slight differences in the two entries that skip the eye.

In our experience, we have come to see that the common reasons for mismatch in these cases are:

  1. Blank cells
  2. Whitespaces
  3. Case sensitiveness
  4. Mismatching date strings

Let's go through these one by one.

1. Empty rows due to blank cells

Number one reason for blank cells in Join results is blank cells in key columns. There could be blank entries in both the key columns or it can only be in one of them.

Look at the following picture for example, you’ll see how we got some blank Brand and Device entries here after Join:

Empty rows in join result

This is because when we tried to bring in Brand and Device columns from the source dataset with Price column as one of the keys, it didn’t match as there were blank entries for Price column in the Destination DS.

Image showing non-empty rows in Source table

This is the configuration we used here:

Join configuration

2. Empty rows due to whitespaces

Another reason for empty cells in Join could be whitespaces. Not to mention, whitespaces are hard to catch with naked eyes. But if you’re encountering empty rows in Join results it may be a good starting point. Check your text key columns for whitespaces.

Continuing with the above Dataset. This time let's add an additional space to values in one of the key columns. Say the competitor company column. We added an extra space between "Vape" & "Store" in the value “Vape Store” using the Label & insert function. Now when we try to bring in Brand and Device data from the source columns, this is what we end up getting:

Empty cells in Join results due to whitespace

I am sure you can tell why.

This is why you should hunt for whitespaces if you are using text columns as keys. Now there can be cases where your text columns are perfectly fine without any extra whitespace. In that case, jump to the next reason.

3. Empty rows due to case sensitiveness

Another common reason behind empty cells in Join has to do with casing. If you are matching two key columns down to the case, the system would try to find the exact match before it joins the two tables. In case of mismatch, it returns blank cells.

See the following example. Here, we have purposely changed values in one of the key columns to lowercase and tried the same Join with the Source DS. This is what we got:

Empty cells in Join results due to case mismatch

If you're getting similar results, the casing might be to blame. Check the text key columns for casing differences in the two tables and standardize it to get the desired results.

4. Empty rows due to date mismatch

Quite often date entries vary from table to table, even if they are essentially tied to some common element. This is because date can be formatted in different ways and some of them could hide the details of time.

For example, you can record your dates as dd/mm/yy and also as mm/dd/yy, and as yy/mm/dd, and as dd-mm-yy, 00:00:00, etc.

If your key columns are two date columns with different formatting, it may hide parts of the actual value.

Suppose you have 18 Aug 2022 12.20 pm in source view and 18 Aug 2022 10:00 am" in the foreign view as the actual value, but both source and foreign view could be formatted as just *18 Aug 2022 in the key column. You see them to be same and expect join to match, but since actual values are different, join will return blank cells.

Date entries recorded differently may appear the same when it reality it's not. This is why we encourage analysts to pay special attention to date columns formatting before using them as keys in Join.

Conclusion

Having a clear understanding of the Join function helps you anticipate the results accurately. Besides, being aware of the common traps help a lot. We hope this article shattered a few misconceptions around Joins.

Have questions? Write to us at mailto:contact@mammoth.io.