may 13, 2026
6 key mistakes in DWH design
Data Warehouse (DWH) design lays the foundation for corporate analytics. Errors at this stage lead to slow report performance, complex maintenance, and rising costs, as well as a loss of trust in the product. In our article, we will examine typical mistakes, their consequences, and ways to avoid them as early as the data warehouse design stage.
1. Chasing data without business goals
The most expensive mistake is made even before implementation begins, when the design phase starts not with the question «What do we want to analyse?» but with the question «What data do we have?».
And the worst-case scenario is when, during the discussion of which data to load into the warehouse, the conclusion is reached: «let’s load everything».
With this approach, the warehouse turns into a «data junkyard». According to IBM’s estimate, companies use only about 12–15% of the data available to them, while the remaining 85–88% remain unclaimed for analytics. I am ready to confirm this from my personal practice. The result is disappointing: resources are spent, and the business does not get what it wanted.
Technical consequences
- The warehouse gradually expands and becomes slow due to useless data that is not needed for reports.
- It is difficult to assess the quality of information, as there are no clear criteria by which to consider it incorrect.
- Additional infrastructure costs — the more unnecessary data is loaded into the warehouse, the more expensive its maintenance will be.
Solution
Define:
- What KPIs does the business need?
- What business events form them?
- What attributes are needed to analyse these events?
Such seemingly simple recommendations are guaranteed (according to our experience) to help avoid overloading the warehouse with unnecessary information and make it faster and cheaper to maintain.
2. Choosing an unsuitable architecture
The second critical mistake lies in choosing an architecture that is determined not by rata Vault» simply because they worked with it at a previous job, or chasing trends: «everyone is moving to lakehouse, so we want it too».
At the same time, key parameters are ignored, such as:
- The nature of queries (ad-hoc analytical queries or regulated reports);
- Data latency requirements (the time it takes for data to travel from the source to the analytics);
- The volume and frequency of source changes;
- The maturity level of the team (the ability to develop and the readiness to maintain a complex model).
Technical consequences
- The architecture begins to hinder development.
- Any change requires significant development effort.
- Data readiness time for analytics increases.
- The system becomes fragile: local changes cause cascading effects.
Solution
Always choose the project architecture based on requirements. Rely not on a single approach, but on an architecture with tiered separation and a combination of models. Different tasks require different models, and an attempt to solve everything with a single paradigm almost always leads to system overcomplication. Attempting to use one model for all tasks (for example, «Data Vault» right up to the report or a «star schema» as an integration layer) almost always leads to a loss of performance and manageability.
3. Transferring row-based practices to columnar databases
The transition from row-based DBMS (PostgreSQL, Oracle) to columnar ones (ClickHouse, Greenplum, Vertica) requires a paradigm shift. Errors here are fatal for performance. Primarily, this concerns the choice of distribution keys and data types.
Technical consequences
The main problem of any MPP warehouse is data skew. When data is distributed unevenly, one cluster node works for two, killing performance. Another serious problem is using strings instead of numbers in joins, which creates colossal overhead costs.
Solution
- Before starting a project on new technologies, study all their strengths and weaknesses, and analyse how well they align with your business tasks.
- When choosing a distribution key, give preference to unique values. The ideal option is to use a technical row identifier or a composite key in large tables.
- Avoid direct joins on text fields in MPP environments. If business logic requires joining by a string identifier, create a reference dimension at the data preparation stage that will replace the text key with an integer surrogate key for use in further joins.
4. Delayed design of access control rules
Very often, security and access management are designed at the last moment, frequently only after the code has been written.
Technical consequences
- Process dependency on specific employees. Data exchange configuration is performed under a developer’s personal account. If such an employee leaves, access is lost, and production workflows come to a halt.
- Information leakage. Due to the lack of a clear rights matrix, employees may gain access to data not intended for their scope of tasks.
Solution
- Adhere to the rule of using service (technical) accounts for all automated processes, even under tight deadlines.
- Formulate an access matrix as early as the architecture design stage, rather than during implementation.
5. Ignoring CI/CD and test automation at the start
In pursuit of quick results, teams often skip the stage of setting up deployment pipelines. It seems that nothing is wrong if, at the beginning of the project, we apply scripts manually. After a while, this becomes the norm, and manual management becomes a huge problem for the project.
Technical consequences
- Fear of deploying changes to production. No one knows which dependencies will break. Releases are delayed and accumulate, and when they do happen, everyone tensely waits for the daily load to find out if something broke this time or not.
- The issue of rollbacks. If a release breaks production, rolling back without CI/CD is painful. Searching for dumps, remembering which scripts were applied, and writing rollback scripts under the cover of night.
- Lack of automated testing. Changes are checked manually or not checked at all. As a result, errors in the data are discovered only in reports, regression sets in, and there is no guarantee of correctness after changes.
Solution
- From day one, store everything related to data (SQL scripts, procedures, dbt configs) in Git.
- No «hotfixes» in the production database.
- Set up a pipeline that, upon a commit to «main», automatically applies changes to the test environment, and at the push of a button — to production.
- Implement automated quality checks.
- Do not neglect code review. Make it a rule that any change is checked by a second developer.
6. Lack of project documentation
Documentation is often perceived as a secondary task and is postponed «for later». In reality, this leads to the code and individual developers becoming the only source of knowledge about the system.
Technical consequences
- Knowledge is concentrated in people rather than in the system. The departure of a key developer leads to the loss of critical information regarding logic, sources, and dependencies.
- Complexity of making changes. Before any refinement, it is necessary to figure things out «from scratch»: which tables are involved, how a metric is calculated, and where transformations are applied.
- Long onboarding time for new employees.
- Lack of transparency for the business or the client.
Solution
- Documentation must be part of the architecture, not a side artifact.
- Try to automate documentation generation wherever possible.
- Include the verification of document relevance in the code review.
Conclusion
In our article, we have examined the main mistakes in data warehouse design. Of these six mistakes, the first two kill a project even before any code is written. If you do not understand why the business needs the data and you choose an architecture «because it is trendy», there is no point in continuing further. The remaining four are about how not to kill the project in the process.
A reliable warehouse is built not around technologies, but around principles:
- Design driven by business tasks;
- Conscious choice of architecture;
- Account for DBMS specifics;
- Access and security control;
- Automation of delivery and testing;
- Transparency through documentation.
If these principles are established at the start, the system remains manageable during growth. If not, the cost of changes begins to grow exponentially, and trust in the data is gradually lost.