Why Is Data Modeling So Challenging – How To Data Model For Analytics
Photo by Shubham Dhage on Unsplash
Learning about how to data models from basic star schemas on the internet is like learning data science using the IRIS data set.
It works great as a toy example.
But it doesn’t match real life at all.
Data modeling in real life requires you fully understand the data sources and your business use cases. Which can be difficult to replicate as each business might have its data sources set up differently.
For example, one company might have a simple hierarchy table that can be pulled from Netsuite or its internal application.
Whereas another one might have it strewn across 4 systems, with data gaps, all of which need to be fixed to report accurately. So you’ll never really understand the challenge you will face when data modeling until you have to do it. Then you’ll start to understand how you should data model and all the various trad-offs you’ll have to make.
In this article I wanted to discuss those challenges and help future data engineers and architects face them head on!
So let’s dive in.
What Is Data Modeling?
There are plenty of definitions of data modeling that exist out there. For example:
Data modeling is the process of creating a simplified diagram of a software system and the data elements it contains, using text and symbols to represent the data and how it flows. Data models provide a blueprint for designing a new database or reengineering a legacy application. – TechTarget
Another great example came from my discussion with Joe Reis(Co-Author Of Fundamentals Of Data Engineering) where he defined data modeling as:
Organizing and standardizing data to facilitate believable and useful information and knowledge for humans and machines.
Now I’d say Joe’s focus is more the overall goal where as TechTarget describes the process. But both together should provide a good picture of what data modeling is.
Challenges You Might Face in Data Modeling
Data modeling is an evolving practice with diverse challenges.
Here are some of the most common challenges in current data modeling projects.
Companies Use Diverse Data Modeling Techniques
In the past, most companies used similar data modeling strategies, often relying on a few tested models such as the star or snowflake schemas for analytics. There were of course dogmatic and religious like debates between whether one should use Kimball (bottom-up) or Inmon (top-down). But overall, data modeling initially both had to appease the end-user as well as the limitations of physical servers. Thus, many data models were forced to adopt certain techniques either due to lack of space or compute.
Today, you encounter a wide range of data models, each functioning differently.
Some are forced due to the systems they are built on top of. For example, when I worked at Facebook we relied on an HDFS like storage layer which didn’t allow us to run updates on our tables. Thus, in order to track changes in data, we’d just store each date as a snap shot. Meaning that in order to see what occurred on a specific date, you had to filter by said date.
There are also discussions about One Big Table(OBT) vs Kimball. Perhaps a more humorous example of this is Query Driven Data Modeling.
But each of these has pros and cons and they must be considered before saying one solution is best over another.
Integrating Data
Data comes from diverse sources, making integration a significant challenge. At first glance, integrating data may seem straightforward as you create diagrams that show relationships between entities like Orders, Customers, Products, and Stores. However, the challenge lies in the fact that data can have multiple sources.
Take Customers, for instance. While you might assume all customer data comes from the same source, modern sales processes rarely function that way. You may need to integrate data from platforms targeting customers in different geographic locations. Additionally, platforms used for American sales may have different data sources and formats compared to those used for European sales.
Omnichannel sales strategies further complicate data modeling. A company might sell products on its e-commerce website, Amazon, and retail store sites. These sources could provide different types of data about buyers. Even if they collect the same data parameters, the formatting may not align perfectly. One system might place the customer’s order in the second field, while another might put the customer’s address in the second field. Without reliable integration, these conflicting fields can cause problems within your dataset.
It also can cause issues when it comes to reporting. This is why many data individuals joke when they reference, “what is a customer”. That can be defined differently by company and by department.
Integrating Data Challenge Example
Another example I constantly use is when I worked in my first role I was asked to integrate project management data with hourly tracking of contracts. Both points of data were in different sources. At first, when I was explained how the data was set-up, I assumed the problem would be simple.
But it wasn’t.
Both systems were supposed to have an ID that made it easy to join data sets. But one of the systems allowed the ID to be a free text field. I don’t think I need to say more. Basically, there was no real way to join the data in its current state.
In the end, it’s up to data teams and their business counterparts to not only integrate data from a technical perspective but also from a business process perspective.
Translating Human Requirements Into a Data Model
Data models do not exist in isolation; they serve a purpose for the company and its users. However, translating human requirements into a data model presents a significant challenge.
This aspect of data modeling is especially difficult because it requires effectively communicating with people to understand their requirements and then translating those requirements into a flexible data model that can evolve as needs change.
The challenge intensifies when company leadership pushes for a specific concept. For example, they may request consolidating all data into one giant table, which may not align with your data engineering knowledge. In such cases, you must find a way to appease leadership or convince them to use a model that better serves their needs. Neither option is easy to accomplish.
Getting Better at Translating Human Requirements Into Data Models
Translating human requirements into data models requires experience. Ideally, you can find a job that allows you to build on your basic data science skills. Building a good model involves asking the right questions and seeking answers. For example, must this model include a slowly changing dimension (SCD), or is it included out of habit or something an industry leader suggested?
Seek opportunities to work with highly experienced data professionals who can provide deeper insights into the questions you should ask before and during the data modeling process. The more you practice and receive feedback, the easier it will become. Still, expect to invest significant time in developing this skill; it only improves with practice.
Modeling Data Now and Into the Future
Change is inevitable, so anyone interested in data science must continuously learn new skills. What works well today may fall short tomorrow. Therefore, technical knowledge should be combined with creativity to find solutions to the challenges of data modeling.
However, this is no easy feat. Many companies rely on data consultants to help them develop models that work well in the present while providing flexibility for future needs. An experienced data modeling consultant can handle the most challenging aspects, making it easier for your team and company to thrive.
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.
How to build a data pipeline using Delta Lake
Intro To Databricks – What Is Databricks
analytics Consulting data engineering data modeling Data Science