Before conducting data analysis, the first step typically taken by an analytics team is trying to understand the contents of the dataset, the table and column structures, and the relationships between those tables. Personally, I usually start by reading the table names and then looking at the table contents to understand the function and the type of data stored within them. I also verify the names of the existing columns and try to identify key fields such as document dates, creation dates, update dates, and foreign keys to other tables (for example, to determine if the salesorderline table has a relationship with the product table, I would look for a column like productid). Additionally, if any documentation is provided for the database, I will try to thoroughly understand it first. These steps are commonly referred to as data profiling
What is Data Profiling?
Data profiling is the process of examining and understanding the contents of a dataset. This process typically aims to discover existing patterns, data types, relationships between tables, and any errors or gaps, such as inconsistent data. More specifically, data profiling includes:
- Understanding the Structure: Understanding the data type of each column (numeric, text, date, date with timestamp, etc.) and identifying non-compliant data.
- Understanding Data Content: Reading and understanding the data contained within a table and the purpose for which it is stored. Here, it is crucial to know the context of the dataset source itself, such as the industry the business operates in, the company size, etc.
- Data Quality Analysis: Detecting missing values, duplicate data, and ensuring values are stored in the correct format.
- Understanding Table Relationships: Confirming the relationships between tables and determining which columns establish these relationships (primary keys and foreign keys).
Data Profiling in Metabase
How do you perform data profiling in Metabase? Metabase significantly simplifies the tasks required for data profiling by providing features that are accessible with just a single click, allowing you to visualize data distribution effectively. Here are some of the steps:
Browse Data
Through Browse Data, you can view the entire contents of the dataset that has been connected to Metabase. Simply select the dataset you want to view, and Metabase will immediately display all its tables. Click on the table you wish to analyze, and all the columns and data from that table will be displayed instantly.
Summarize
Through Summarize, we can perform basic data profiling, such as counting rows, summing values, counting distinct values, and more. An example of what can be done with this feature is counting the number of transactions based on existing dimensions or segments. If you have a sales_order table and want to count the number of orders by sales channel, you can count the transactions using “Summarize -> Count rows” and group it by the sales channel column. Metabase will immediately display the count results using an appropriate visualization, such as a bar chart.
Filter
Filters allow you to narrow down data to find specific information or ensure the displayed data is relevant. For instance, suppose you want to calculate the monthly sales volume based on the order date. To calculate the sales volume accurately, you only want to count orders with a completed transaction status. To do this, apply a Filter by searching for a column like order status and filtering for the value “Completed”. After that, use “Summarize -> Sum of sales revenue” grouped by the date ordered.
These are just a few examples of what can be done through Metabase to better understand the structure and content of your dataset. Aside from displaying statistics as shown above, the main advantage of using Metabase is its intuitive graphical data presentation that highlights trends and data directions, rather than just showing raw tables and columns. Ease of visualization is a crucial element in data analysis, and Metabase is incredibly helpful because, with just a few clicks, you can generate queries that would previously require learning SQL.
We have discussed how to perform basic data profiling, such as finding out how many sales orders are grouped by order status or looking for sales orders placed within the current month. These steps are useful for helping users understand the business or potential issues arising within it—provided they are equipped with the proper context and business knowledge. This understanding will also help users determine the metrics they need to achieve their key objectives.
Some advanced data profiling steps that can be done next include checking foreign key integrity (since not all data warehouses have explicit foreign key relationships built between tables), data binning, and finding duplicate data.
Checking Foreign Key Integrity
Why is checking foreign key integrity important? It ensures that the data obtained from the source does not contain “orphaned” or missing connections. The data you receive might just be in the form of an Excel file or a database. That Excel file or database itself might ingest data feeds from other Excel files, separate databases, or APIs. Quite often, data fed from external sources can cause a loss of data integrity. Example: If there are sales_order and sales_order_line tables, the sales_order table should ideally contain a unique identifier column, such as an ID or order number. Since the sales_order_line contains items from a sales order, it must also have the unique identifier of that sales order to know which order the item belongs to. However, when data integrity is lost, a sales_order_line record might exist without a corresponding sales_order. From a business perspective, this makes no sense, as an item cannot be ordered without an associated sales order number. Therefore, this check is vital.
The way to perform this check in Metabase is as follows:
- Create a new Question or open your
sales_order_linetable directly via Browse Data. - Perform a Left Join from the
sales_order_linetable to thesales_ordertable using the primary key and foreign key. - After joining, apply a Filter on the
sales_ordertable, select the ID or order number column, and choose Is Empty. - Click Visualize to display the data.
Ideally, this question should return no results, indicating that all your sales order lines have a matching sales order. However, if any records do appear, this should be flagged and escalated to the party responsible for providing the data.
Binning Data
Data binning is a method used to group data into specific range categories. The main purpose of data binning is to observe the distribution within the data. The most ideal use case I frequently use is analyzing the value per sales transaction in a business. The steps are as follows:
- Create a new Question or open your
sales_ordertable directly via Browse Data. - Select Summarize and choose Count. Group by the
amount/ total value column of a sales order. Next to the column name, a binning option from Metabase will appear, which defaults to “Auto binned”. Select this option. - Click Visualize to display the data.
The question should display the number of sales orders based on the transaction value ranges present in the data. From this data, you can gather many insights; for instance, if there is a high volume of orders within a low-value range, the business likely operates in retail, selling smaller/lower-value items, and the number of items per order is likely higher (more order lines). The volume of transactions will also be higher, which translates into a need for larger storage capacity to accommodate the data.
Finding Duplicate Data
Duplicate data is clearly a major problem when you want to analyze a business. For example, sales figures can become inflated due to duplicate orders, or the queue for monitoring unfulfilled orders might appear larger than it actually is. Here is how to find duplicate data within a table:
- If the
sales_ordertable has a primary key or a column that serves as a unique identifier for the record, simply Count and Group by that specific column. For example, if there is anorderidcolumn, count and group byorder id. - If the table does not have a primary key, you can Count and Group by as many columns as possible, or even all columns. This step might be easier to execute using SQL.
- After doing this, scroll down to apply a Filter to your count column, and filter for values Greater than 1.
If the question returns data, it will show records like the orderid alongside a count of 2 or more. This indicates that a specific orderid exists multiple times in the table across more than one record. Before concluding that this is definitely duplicate data, double-check that the column is indeed intended to be the unique identifier for that table, and ensure you haven’t selected the wrong table. For example, in a sales_order_line table, an orderid will naturally have more than one record because a single sales order can contain multiple items.
These are the advanced data profiling steps you can perform on your dataset. The key to understanding data and Metabase itself is to always spend time experimenting and not be afraid to break things. The more time you spend exploring, the better you will understand the data—and data-driven insights can easily surpass even the gut feelings of a business owner. Therefore, never underestimate the power of data, and happy exploring!
Interested in Getting Started?
While the steps presented above can be done directly in Metabase, they can be quite time-consuming. Simplify your data analysis process with our comprehensive services, designed to help you resolve issues and meet the data analytics needs of your company.