december 13, 2025
How to build a data quality control system: experience with dbt and Soda
At the active development stage, a team’s primary efforts are usually focused on delivering functionality, but once a project moves into the maintenance phase, it quickly becomes clear that without systematic quality control, trust in the data begins to decline. A wide range of approaches and tools is available to ensure data quality within a project. One common option is the use of validation SQL or Python scripts that are executed manually as needed or automatically on a schedule via orchestrators (such as Airflow, NiFi, Pentaho, and others). This approach is already successfully used across several projects in our company and effectively meets the key requirements for data quality control.
There is also an entire class of specialised tools designed to formalise and automate data-quality checks. The most popular among them are dbt and Soda. This article discusses how these tools are used in real projects.
DBT is a versatile, open-source tool for building and managing data-transformation processes, and its capabilities extend far beyond quality checks. It enables the development of data models, dependency management, logic documentation, and automated creation of data marts. Data-quality control is one of its areas of application. DBT allows tests to be embedded directly into the transformation process, describing checks in the form of declarative SQL tests. These checks fall into two categories: built-in (standard) and custom.
Built-in tests include: key uniqueness checks (unique), column completeness checks (not null), validation of permissible attribute values (accepted_values), and referential integrity checks (relationships).
Custom tests can be created in two ways. The first is to write a SQL query that returns rows not meeting expectations. If the query returns at least one row, the test fails. The second is to create a parameterised macro that can then be invoked from any model.
Example of creating simple checks using dbt
The example will be demonstrated using the PostgreSQL DBMS on a Linux operating system.
First, create a users table. Table structure:
For dbt to work, Python must be installed. If it is not installed, run:
Next, create a virtual environment for working with dbt:
If dbt is not installed, install it on the machine (in our case, for the PostgreSQL database):
Then initialise the dbt project:
In the directory /.dbt/, create the file profiles.yml.
You must specify the database connection parameters in it:
After creating the file, you can verify its validity using the command:
If everything is correct, the output will include:
Everything is now ready for creating test models. Inside the project folder, create the directory structure models/staging.
In the staging directory, create two files:
1) stg_users.sql model file:
This file defines the connection to the “raw” table edu.users.
In the query, you can rename fields, change data types, or filter out unnecessary values.
As a result, a staging model stg_users is created, which can then be used by other models.
2) stg_users.yml test file:
The test file describes the structure of the model and the automated data-quality checks (tests).
As shown, we have limited ourselves to standard checks, namely:
- The user_id field must be populated and unique.
- The email field must be populated and unique.
- The status field must contain values from the list [‘active’, ‘inactive’, ‘banned’].
In dbt, you can create models for all tables that require validation. The main rule is: one model — one entity. If multiple tables need to be combined, a new model is created based on the existing ones.
Once all models have been created, they can be executed. Below is the command for running all models:
If you want to run a specific model, the command looks like this:
The output will contain a report on the status of the checks:
Thus, we have looked at an example of a simple check in dbt.
Creating simple tests using soda
Soda is a specialised open-source tool originally designed for monitoring and ensuring data quality. Its core idea is to allow teams to define quality rules in a simple declarative format without diving into programming.
Checks in Soda are defined in YAML files and, just like in dbt, there are both standard and custom checks.
Let us examine how to create checks in Soda, using the same users table.
To begin, install Soda for PostgreSQL:
Create a project folder:
Then, inside this project folder, create the file connection.yml, where the connection parameters are defined:
Next, create the checks file checks.yml:
That is all. The only step left is to run the checks:
The result will look something like this:
Conclusion
As we can see, both tools are fairly simple in terms of installation and use. In essence, each of them requires only two things: a file with the database connection and a file with the checks. The tools themselves are also very similar in terms of DQ validation. This raises the question: which one should you choose for your project? There is no definitive answer — but there are recommendations.
If most of the data work is concentrated in the transformation layer — for example, when building data marts and analytical models — then dbt allows tests to be naturally embedded into the existing pipeline or even break it if a test fails. It also enables the use of a single ecosystem for development, documentation, and quality control.
If, however, the team wants to implement an independent monitoring layer that tracks data quality across different systems and at various stages of processing, then Soda is the better choice. It allows checks to be described flexibly (including specific rules such as: “the number of orders with a negative amount must not exceed 0.1%”), run independently of ETL processes, and integrate the results with alerting systems.
Soda is also preferable when you work with multiple systems, not only dbt. Soda can connect to Postgres, Snowflake, MS SQL, BigQuery, and others, without relying on dbt.
A combined approach is also possible: basic checks can be implemented in dbt, while Soda is used for broader monitoring and data-quality analytics. This approach helps strike a balance between transparency, convenience, and the depth of control — though it does imply that more checks will need to be created across different tools.