After discussing Extract yesterday, we will now discuss the next process in ETL: Transformation. Data transformation serves to organize an organization’s data into a more structured and meaningful format. This is a crucial step for any organization to properly analyze their data. One popular tool for performing data transformation is Pentaho Data Integration (PDI). In this article, we will explore the various functions that can be used in Pentaho to design effective data transformations.
Important Functions in Data Transformation
The transformation process involves processing data from its original format into one more suitable for analysis and storage purposes. Users can utilize the tools provided by Pentaho to design and customize data transformations. Transformations can include data cleaning, changing data types, merging data, and other steps to ensure improved data quality. Check out the essential functions we can use within Pentaho!
- Mathematical Functions: Pentaho provides various mathematical operation functions that can be used for data transformation. Users can create complex expressions or use built-in functions to manipulate data as needed. We can use the “Calculator” step to perform mathematical operations on columns.
- Expression Examples:
- Calculating total transactions:
transaction_amount + previous_total_transactions - Rounding transaction amounts:
ROUND(transaction_amount)
- Calculating total transactions:
- Expression Examples:
- Filtering: Users can define filtering criteria to include or ignore specific data during the transformation process. This allows us to focus on more relevant data subsets. We can filter data specifically for transactions over $100 by using the “Filter Rows” step to exclude rows that do not meet certain criteria.
- Expression Example:
transaction_amount > 100
- Expression Example:
- Data Validation: Pentaho allows users to incorporate validation steps into data transformations to ensure data integrity and quality. Data validation can be added using the “Filter Rows” or “Validator” steps.
- Expression Example: Ensuring that critical columns are not empty, or that document statuses are strictly set to ‘Completed’.
- Merging Data (Join): If we want to combine customer data with sales transaction data based on the customer ID, we can use the “Merge Join” or “Database Join” steps in PDI to perform this merge.
- SQL Expression Example:
SELECT customers.customer_id, customers.customer_name, sales.transaction_id, sales.transaction_amount FROM customers JOIN sales ON customers.customer_id = sales.customer_id - Column Operations: Pentaho allows us to modify data columns as needed. For example, converting date formats or combining two columns into one.
- Expression Example: Using the “Select Values” step to change the column format from number to text.

Main Functions Commonly Used in Data Transformation
Let’s discuss Transformation in more detail by looking at the expressions or functions we can use in Pentaho Data Integration (PDI).
1. String Functions
- UPPERCASE and LOWERCASE: Converts text to uppercase or lowercase letters.
- SUBSTRING: Extracts a portion of text from a string.
- REPLACE: Replaces specific values within a string.
- CONCAT: Combines multiple strings into one.
2. Numeric Functions
- ROUND and CEIL: Rounds numeric values up or down.
- ABS (Absolute Value): Returns the absolute value of a number.
3. Date Functions
- NOW: Returns the current date and time.
- DATE_DIFF: Calculates the time difference between two dates.
- Date Formatting: Formats dates into specific formats.
4. Data Type Conversion Functions
- CONVERT and CAST: Converts data types from one form to another.
5. Logical Functions
- IF-ELSE: Standard branching logic.
- CASE WHEN: Complex branching logic.
- COALESCE: Replaces null values with a default value.
6. Aggregation Functions
- SUM, AVG, MIN, MAX: Aggregation operations on numeric columns.
- GROUP_CONCAT: Combines values within a single column.
7. Additional Functions for Complex Data Transformation
- Data Analysis Functions: LAG, LEAD, RANK, DENSE_RANK, ROW_NUMBER.
- Pivot and Unpivot Functions: Alters data structures.
- Windowing Functions: Calculates accumulated totals within specific windows.
By understanding and utilizing the various functions above, Pentaho users can design flexible and efficient data transformations. However, it is important to always consider specific business needs and the type of data being processed in every transformation process.
Editor’s Note
In 2019, Matt Casters, the creator of Kettle Pentaho Data Integration, announced a new project called Apache HOP, which is a fork of Kettle. This project leans more towards open-source, and by becoming a top level project in the Apache Software Foundation, we have decided to continue with Apache HOP, which aligns better with our vision as open-source prac