Google BigQuery
Google BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a Platform-as-a-Service (PaaS) that supports querying using ANSI SQL.
A View can be exported as a table to a BigQuery dataset on the Google Cloud Platform. For this export, you first need to configure Google Cloud. Here’s how to do this:
Enable BigQuery API
- Log in to Google Cloud Platform and select the project to use in Mammoth;
- Go to API and Services from the sidebar;
- Click on + Enable APIs and Services and look for BigQuery API;
- Open BigQuery API and click on Enable;
Create a Service Account
-
From the sidebar, select IAM & Admin and go to Service accounts;
-
Click on Create service account;
-
In Service Account Details, enter the service account name and description. Google Cloud will then choose an email for your service account randomly. Click on Create;
-
Click into Grant this service account access to project, add two roles and select BigQuery Data Editor and BigQuery Job User to give the service account permission to complete specific actions on the resources in your project. Click Continue;
-
In Grant users access to this service account, add the email address of the end user. By default, you will be the sole owner of the service account. This step is optional. Click Done;
- For the newly created service account, go to the action menu and select Manage keys;
- Select Add key and click on Create new key.
- Select the key type as JSON (Mammoth accepts JSON type keys). Click on Create and a JSON key will be downloaded into your system. This JSON key will be required while connecting Mammoth to BigQuery.
A service account can be used for multiple projects.
- Switch to the project which needs this service account;
- From the sidebar, go to IAM & Admin and click on +ADD;
- Enter the email of the service account, create the roles and save the file.
Connect Mammoth to BigQuery
- In Mammoth, go to Exports and Share > Export to database > Already existing database > Google Bigquery.
- Upload the JSON key which was downloaded in the (TODO: link (previous step)).
Make sure you have an existing Dataset in the Google Cloud project where the View can be exported as a table. To create a Dataset in BigQuery:
- Go to Google Cloud Console;
- Select Big Data > BigQuery from the sidebar;
- Select your project from Explorer tab, and click Create Dataset.
- Select the correct Dataset and enter the table name;
- Choose a mode for future exports amongts - Replace, Combine, Merge
- Validate the connection and click Apply once it turns green after the Validation. The table is then exported to the selected Dataset.
On the Google Cloud Platform, select Google BigQuery from the sidebar. Select the same project with the key and open the Dataset you had selected in Mammoth. The table exported from Mammoth should now appear.
Three Modes of Exports in BigQuery
Mammoth offers the following modes for future BigQuery data exports:
-
Replace
-
Combine
-
Merge
-
Replace: The Replace option replaces all current data in BigQuery with new data.
For instance, if you originally had 10 rows in a BigQuery table, and the update has 20 rows. Your final output will have the new 20 rows.
- Combine: The Combine option appends the new data with the current data in BigQuery.
So if you originally had 10 rows in a BigQuery table, and have 40 new incoming rows, the Combine option will just add the 40 rows to the existing. The output will thus be 10+40 rows = 50 rows.
- Merge: The Merge option uses a combination of unique pairings (key columns) to check for updates in the current data as well as to identify new rows in the incoming data.
It then updates the current data with changes and inserts new unique rows, if there are any. This makes the exported data clean and compact and ready for visualization.
Take this table as an example:
S.No. | Value |
---|---|
1 | A |
2 | B |
3 | C |
Suppose you have exported this data to BigQuery in the Merge mode and get the following data as the next export with S.No. as the key column:
S.No. | Value |
---|---|
1 | X |
4 | Y |
3 | Z |
The system will check for changes in the existing data, here rows 1 and 3 and update those. It will then look for new rows (here 4) and insert that.
This is how the final table in BigQuery will look like:
S.No. | Value |
---|---|
1 | A |
2 | B |
3 | C |
4 | Y |
When exporting to BigQuery in the Merge mode make sure there are no duplicates in the data. Use the Remove Duplicates function in Mammoth to remove duplicates from your data. The Merge mode doesn't allow for duplicated data.
Table partitioning
Google Bigquery allows you to partition your tables based on:
- data ingestion date or
- columns such as date or integer.
Table partitioning is a great way to store data in silos. It makes querying data faster and also cuts your Bigquery costs.
In Mammoth you can enable table partitioning while exporting your data to Bigquery in a new table. Mammoth allows only date based partitioning owing to its common use case.
- You can create a maximum of 4000 partitions in Bigquery.
- You cannot edit partitioning rules in Combine or Merge modes for an already existing table in Bigquery. You can however edit partitioning rules in the Replace mode as that deletes and recreates a table.
- If you delete the Bigquery export rule in Mammoth, you won't be able to edit partitioning of the previously generated table.