Parquet

Parquet is an open-source columnar storage format that is optimized for reading and writing large-scale, compressed data. It is designed to work efficiently with big data processing frameworks, such as Apache Hadoop and Apache Spark.

Parquet stores data in a columnar format, which means that it stores each column of data separately, rather than storing all of the rows of data together. This can make it much more efficient for querying and processing large data sets, as it allows for column-wise compression and encoding, as well as data skipping, which avoids reading unnecessary data.

Connecting to Parquet

Mammoth allows you to read data from Parquet files and get the data into Mammoth. Currently we only fetch Parquet files from Azure Data Lake.

  1. Select API & Databases from the ‘Add Data’ menu and click on Parquet.

  2. Click on New Connection and log into your Parquet account.

  3. Select the account.

  4. Once your azure data lake account is connected with Mammoth, you will be presented with a list of tables and views in that database.

    • Select the desired table to get a preview.

    • Write a SQL query or run a test query and preview the result.

    • Click on Next.

    Parquet table selection

    Fig. 52 Parquet table selection

After you have selected the table you want to work on, you get options to schedule data imports as discussed in the next section.

Scheduling your Data Pulls

You can start retrieving the data now or at a specific time. Further schedule the data imports to get the latest data from your Database at a certain time interval - just once, daily, weekly or monthly.

On every data pull from your Database, you also have an options to - Replace all data, Add new data since data pull, or Replace with new data since new pull.

On choosing options Add new data since last pull or Replace with new data since new pull, you will get an option to choose a unique sequence column. Using this column, on refresh, Mammoth will pick up all the rows that have greater value in this column than the previous data pull.