Data Warehousing Essentials: A Guide To Data Warehousing
Photo by Tiger Lily
Data warehouses and data lakes play a crucial role for many businesses.
It gives businesses access to the data from all of their various systems. As well as often integrating data so that end-users can answer business critical questions.
But if we take a step back and only focus on the data warehouse, what is it anyway? And why do companies invest so much into data warehouses?
In order to answer that question let’s start by going over Kimball’s goal of Data Warehousing(and to be clear there are plenty of other authors and individuals you can dig into that discuss data warehousing and dimensional modeling).
The Goal of a Data Warehouse
Before we delve into the details of modeling and implementation, it is helpful to focus on the fundamental goals of the data warehouse. The goals can be developed by walking through the halls of any organization and listening to business management. Inevitably, these recurring themes emerge:
-“We have mountains of data in this company, but we can’t access it.”
-“We need to slice and dice the data every which way.”
-“You’ve got to make it easy for business people to get at the data directly.”
…
-“It drives me crazy to have two people present the same business metrics at a meeting, but with different numbers.”
-“We want people to use information to support more fact-based decision-making.”
These goals have been echoed for decades. In fact, I am sure if you read a few articles written in the past few months, some of these points will come out as if they were being listed for the first time.
It’s almost as if we are in a continuous cycle in the data world, never truly escaping the same set of problems.
But let’s get past these problems for now and look at an example of a design decision you might need to make when you are focusing on using dimensional modeling.
What Is A Data Warehouse Really?
A data warehouse is a centralized repository designed to store integrated data from multiple sources. Unlike regular databases focused on handling transactions, a data warehouse is built to aggregate historical data. Data warehouses are also developed to facilitate business intelligence (BI) activities and reporting. In addition, you’ll often see data warehouses used for data science and ML uses cases. Although some companies will use data lakes for this as well.
Data Lake vs. Data Warehouse
As mentioned earlier the terms ‘data lake’ and ‘data warehouse’ are often mentioned in the same breath. Yet they serve distinct purposes and cater to different needs within an organization’s data strategy.
- Data Lake: A data lake is essentially a folder system. The data inside said system can range from unstructured, semi-structured to structured. It is highly flexible and capable of storing data from diverse sources, including IoT devices, social media, and more. Data lakes are ideal for storing data that doesn’t fit neatly into tables. It’s also great for building POCs for machine learning engineers instead of waiting for data engineers to build we
- Data Warehouse: In contrast, a data warehouse stores data that has been processed, structured, and refined for specific purposes, primarily analytics and reporting. Data warehouses are designed with performance and data integrity in mind, ensuring that data is consistently formatted and easily accessible.
The choice between a data lake and a data warehouse depends on the specific needs and strategies of an organization, including the types of data they collect and how they intend to use it.
Examples Of Cloud Data Warehouses
The evolution of cloud computing has significantly impacted data warehousing, offering scalable, cost-effective solutions that cater to the needs of modern businesses. Here are some of the leading cloud data warehouse platforms:
Redshift
Amazon Redshift, part of the Amazon Web Services (AWS) ecosystem, is a fully managed, petabyte-scale data warehouse service. Redshift is designed for high performance and scalability, using columnar storage and data compression to efficiently handle complex queries across large datasets. It integrates seamlessly with other AWS services, providing a robust solution for businesses looking to leverage their data for analytics.
BigQuery
Google BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. It allows users to analyze petabytes of data using SQL-like queries. BigQuery’s serverless architecture means that users can focus on analyzing data to find meaningful insights using familiar SQL and not have to worry about infrastructure or server management.
Snowflake
Snowflake stands out in the cloud data warehousing space for its unique architecture and approach to data storage and analytics. Unlike traditional data warehouses that might struggle with scalability and concurrent workloads, Snowflake utilizes a multi-cluster, shared data architecture that separates storage and compute functions. This allows for almost infinite scalability and the ability to run multiple workloads without contention. Snowflake’s design supports a broad spectrum of data workloads, including data warehousing, data lakes, data engineering, data science, and more, all within a single platform. It also offers robust support for various data integration, business intelligence, and analytics tools, making it a versatile choice for organizations looking to consolidate their data analytics operations in the cloud.
Database vs. Data Warehouse
Both databases and data warehouses serve as repositories for data. But, they are optimized for different purposes, which affects their design, functionality, and use cases.
- Databases: Primarily designed for transaction processing, databases are optimized for CRUD operations (Create, Read, Update, Delete) and are structured to ensure the integrity and consistency of data in real-time. They are typically used for day-to-day operations within applications, such as recording transactions, customer information, and other operational data. Databases are built for speed and efficiency in handling transactions but are not optimized for analytical queries that involve large volumes of data.
- Data Warehouses: Conversely, data warehouses are specifically designed for query and analysis. They are optimized for read operations and are structured to make data retrieval for analytics as efficient as possible. Data warehouses support complex queries, aggregations, and joins across large datasets, making them ideal for business intelligence, reporting, and data analysis. Unlike databases that prioritize normalization to minimize data redundancy, data warehouses often denormalize data schemas to optimize query performance.
If you’re looking to set up a data warehouse in 2024, then please feel free to reach out!
Set up a free consultation today!
Diving Deeper Into What Makes Up A Data Warehouse
Data Warehouse Tables
Fact Tables
“A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored.” -Ralph Kimball
These tables contain the quantitative data for analysis and are typically transactional. For example, these transactions often represent sales, shipments, calls made, clicks, etc. If you’re keeping to a very strict data warehouse approach(most data warehouses I see don’t) then fact tables generally have two types of columns: measures and foreign keys to dimension tables.
Measures are the numerical data (such as quantity sold, revenue, etc.) that analysts want to sum, average, or perform other calculations on. The foreign keys are the connections to the dimension tables.
Dimension Tables:
“Dimension attributes serve as the primary source of query constraints, groupings, and report labels. In a query or report request, attributes are identified as the by words. For example, when a user states that he or she wants to see dollar sales by week by brand, week and brand must be available as dimension attributes.” -Ralph Kimball
These tables are descriptive attributes related to fact data. They provide context to the data in the fact tables, such as time, location, product details, customer information, etc.
Dimension tables are often denormalized. This means they might contain redundancy and usually include a wide variety of attributes to allow for flexible analysis. For example, a product dimension table might include not just an ID and name but also category, size, color, and other attributes that could be useful for analysis.
Bridge Tables (or Link Tables)
“These are used in “many-to-many relationships” between dimensions. For instance, if you have a scenario where multiple products can be in different promotions at the same time, a bridge table would be used to manage this relationship. “ -Ralph Kimball
Another common example used in most articles and in Kimballs’ books is healthcare and a patient’s diagnosis. This is because a patient’s diagnosis can have more than one diagnosis at a time. In turn, you’ll often see a bridge table used (See the model below).
Many data modelers try to avoid implementing too many bridge table situations as they can add a lot of risk in terms of miscounting or joining across the tables.
Role-Playing Dimension Tables
“Role-playing in a data warehouse occurs when a single dimension simultaneously appears several times in the same fact table.” -Ralph Kimball
A single physical dimension can be referenced multiple times in a fact table, playing different roles. For instance, a date dimension might be used in one fact table as the order date, shipping date, and delivery date. This is likely the most common example of a role playing dimension. That being said, most people probably still just reference it as a dimension table.
Example of Data Modeling Decisions
When you’re designing your data analytics platform(there are so many types these days that it’s hard to just write “data warehouse), there are a lot of important design decisions that you’ll need to make. These decisions can lead to better or worse performance, unexpected costs, and change the general usability of your data.
One example that you’ll likely run into as a data professional is having to data model a header and line item table.
This situation happens all the time; if you’ve got invoices in your data warehouse, then you’re likely very familiar with the concept of headers and line items.
So how do you model this data? There are two methods outlined below. In the first example your data is normalized creating both a line item and header fact table.
The line item and header will need to join together to get all the information in this method. This approach is often discouraged, at least by Kimball who points out several issues caused by this modeling pattern. In particular, joining two large fact tables is computationally expensive and might not return depending on the database you’re running on.
Instead, his suggestion is to denormalize the two tables into one. With this design, you’ve now got to be wary of double counting the invoice number as it might show up multiple times for each line.
But as I said before, this is a decision that shouldn’t just be made. I have seen both work successfully.
Now just in case some of these terms are new to you, such as fact table, I did want to dedicate some of this newsletter to discussing the basics of different types of tables in a dimensional model.
Design Decisions in Data Modeling for Data Warehouses
Data modeling for data warehouses is a critical process that involves structuring data in a way that optimizes storage, retrieval, and analysis. The following sections cover important design decisions that impact the efficiency, scalability, and usability of a data warehouse.
Choosing the Right Data Modeling Approach
- Star Schema: The star schema is a popular choice for data warehousing due to its simplicity and efficiency in handling queries. The decision to use a star schema involves organizing data into a central fact table that references multiple dimension tables. This schema is particularly effective for straightforward queries and is easily understandable by non-technical users.
- Snowflake Schema: The snowflake schema is a variant of the star schema where dimension tables are normalized into multiple related tables. This approach reduces data redundancy and improves data integrity but can lead to more complex queries and potentially slower performance. The decision to use a snowflake schema may be influenced by the need to conserve storage space or maintain strict normalization standards.
- Galaxy Schema: Also known as a fact constellation schema, this approach involves multiple fact tables that share dimension tables. It is suitable for more complex data warehousing scenarios where different business processes need to be analyzed together. The decision to adopt a galaxy schema hinges on the complexity of business analytics needs and the relationships between different datasets.
Determining the Granularity of Fact Tables
- Granularity: The level of detail stored in fact tables is a critical design decision. Higher granularity (more detailed data) allows for more detailed analysis but requires more storage space and can impact query performance. Lower granularity (less detailed data) conserves space and improves performance but may limit the depth of analysis. The choice of granularity should be based on the business’s analytical needs and the trade-offs between performance and analytical depth.
Balancing Normalization and Denormalization
- Normalization: Involves organizing the data to reduce redundancy and improve data integrity. While normalization is beneficial for operational databases, excessive normalization in a data warehouse can lead to complex, slow-performing queries.
- Denormalization: The process of combining data to reduce the number of joins needed during queries. While denormalization can improve query performance in a data warehouse, it can also lead to data redundancy and potential inconsistencies. The decision to normalize or denormalize data in a data warehouse involves balancing query performance with storage considerations and data integrity.
Designing for Scalability and Performance
- Indexing and Partitioning: Decisions around indexing and partitioning fact and dimension tables can significantly impact query performance. Effective use of indexes can speed up data retrieval, while partitioning can help manage large datasets by breaking them down into more manageable pieces.
- Aggregates and Materialized Views: Creating aggregate tables or materialized views for frequently queried summaries can enhance performance. This decision depends on the specific queries that users will run most often and the need to balance storage costs against performance gains.
Integrating Data Quality and Governance
- Data Quality Measures: Incorporating data quality checks and validations into the data modeling process is crucial to ensure the reliability of analyses. Decisions around data cleansing, validation rules, and error handling mechanisms are essential to maintain high-quality data in the warehouse.
- Data Governance: Establishing a data governance framework that includes metadata management, data lineage, and access controls is vital for maintaining the integrity, security, and compliance of the data warehouse.
Things You Might Still Want To Know About Data Warehouses
Building A Data Warehouse: Key Considerations
Embarking on the journey to build a data warehouse requires careful planning and consideration of several key factors:
- Data Integration: Assessing the various data sources and planning the data integration process is crucial. This includes establishing ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) pipelines to collect, clean, and consolidate data from disparate sources into the data warehouse.
- Schema Design: Choosing between a star schema, snowflake schema, or other data modeling techniques will impact the flexibility and performance of the data warehouse. This involves structuring the data into fact and dimension tables to support the types of queries and reports the business needs.
- Storage and Compute Resources: Depending on the chosen platform (e.g., Redshift, BigQuery, Snowflake), consider how storage and compute resources are managed. Some platforms offer on-demand scaling, while others may require upfront provisioning.
- Security and Compliance: Ensuring the data warehouse meets the necessary security standards and regulatory compliance requirements is paramount. This includes data encryption, access controls, and audit logging.
- Cost Management: Understanding the pricing models of cloud data warehousing solutions and implementing best practices for cost management to avoid unexpected expenses.
Data Warehouse Best Practices
To maximize the value of a data warehouse, consider the following best practices:
- Data Quality: Implement processes to ensure the accuracy, completeness, and consistency of data entering the data warehouse. Poor data quality can lead to misleading analysis and business decisions.
- Performance Optimization: Regularly monitor and optimize queries and data structures to maintain high performance, especially as data volumes grow and query complexity increases.
- Scalability: Plan for future growth by choosing a data warehousing solution that can scale easily as data volumes
- Security: Apply stringent security measures, including encryption, role-based access control, and regular security audits, to protect sensitive data.
- Documentation and Governance: Maintain thorough documentation of the data models, ETL processes, and data lineage to support data governance and compliance efforts.
Data Warehouse Vocabulary
Navigating the world of data warehousing can be challenging, especially with the specialized vocabulary used in this domain. Understanding these terms is crucial for professionals involved in data management, analysis, and decision-making processes. Here’s a glossary of key data warehouse terms:
- Data Warehouse (DW): A centralized repository designed to store, organize, and manage large volumes of structured data from multiple sources for querying and analysis.
- Data Mart: A subset of a data warehouse focused on a specific business line or department, such as sales, finance, or marketing. Data marts are designed to meet the specific needs of a particular group of users.
- ETL (Extract, Transform, Load): A process by which data is extracted from various sources, transformed into a format suitable for analysis, and loaded into a data warehouse. The transformation process may include cleaning, aggregating, and restructuring data.
- ELT (Extract, Load, Transform): An alternative to ETL, where data is first extracted and loaded into the target system, and transformations are performed afterward. This approach leverages the processing power of modern data warehouses to transform data.
Schemas
- Star Schema: A simple database schema used in data warehousing characterized by a central fact table surrounded by dimension tables, resembling a star shape. This schema is optimized for query performance in a data warehouse.
- Snowflake Schema: An extension of the star schema where dimension tables are normalized, breaking them down into additional tables to eliminate redundancy. This schema can lead to more complex queries but saves storage space.
- OLAP (Online Analytical Processing): A category of software tools that provide analysis of data stored in a database. OLAP tools enable users to conduct multidimensional analysis, allowing for complex calculations, trend analysis, and data modeling.
- BI (Business Intelligence): The strategies and technologies used by enterprises for data analysis and business information management. BI encompasses a wide range of tools, applications, and methodologies that enable organizations to collect data from internal and external sources, prepare it for analysis, develop and run queries, and create reports, dashboards, and data visualizations.
Stay Tuned For More
Data modeling still plays an important role in today’s modern data world, and the decisions you make impact the cost of computing, the data’s usability, and more. With that, I hope this refresher was helpful.
If you want me to write more about data modeling in the future, comment below and let me know!
Thanks for reading.
Also, if you’re looking to cut your data infrastructure costs in 2024, then please feel free to reach out!
Set up a free consultation today!
Thanks for reading! If you’d like to read more about data engineering, then check out the articles below.
Normalization Vs Denormalization – Taking A Step Back
4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market
What Is Change Data Capture – Understanding Data Engineering 101