How To Data Model - Real Life Examples Of How Companies Model Their Data

How To Data Model – Real Life Examples Of How Companies Model Their Data

research@theseattledataguy.com May 31, 2024 Uncategorized 0

How companies data model varies widely.

They might say they use Kimball dimensional modeling. However, when you look in their data warehouse the only part you recognize is the word fact and dim.

Over the past near decade, I have worked for and with different companies that have used various methods to capture this data. I wanted to review some of the techniques that are commonly used to model data for analytics.

This is part of my unofficial series on data modeling, so if you’d like to learn more, then you can check out some of the prior articles such as this one.

But for now, let’s dive in.

Tracking Changing Dimensions

If you’ve queried a data warehouse before, there is a good chance you’ve come across slowly changing dimensions. Perhaps you didn’t realize it but if you saw some form of start_effective_date, end_date, effective_date, or end_date combination, that’s likely what you were seeing.

Now in the wild, I have really mostly seen SCD (slowly changing dimension) Type 2 and some hybrid versions of Type 6. But mostly Type 2.

Type 2 inserts a new row for each change and captures when those changes occur.

The common example I use to explain how SCD Type 2 works is if you want to report on the number of days a person works in a specific role.

From a business standpoint you might want to answer how long it takes for different teams to promote an individual. In order to do so, you need to know when a person changed from one level or role to another.

So you might store your data like the one below:

This captures when certain changes happen and how long they last. This means you can safely report on the question we gave earlier(How many days on average does it take for an employee to switch roles).

There is another type of SCD called Type 1, which is simply updating the information by replacing the old information. The problem here is you lose the prior information meaning you can no longer report on it accurately. Because in our example above, you’d just replace John’s job_title with data engineer and perhaps change an update_date column(if that exists). But you lose the fact that John was a data analyst.

In turn, that’s why I’ve mostly come across SCD type 2. But it’s not the only way to capture data that changes.

Facebook And Date Partitions

When I first started working at Facebook, one of the differences I noted was that, unlike most other companies that I worked at, they tended to take snapshots of dimensional information every day(honestly, they also used the term “dimension” loosely”). In fact, I have noticed that many tech focused organizations seem to reference something similar. But not everyone is sure if they are bought in.

For those new to the idea of daily partitions, you can imagine it looked like the image below.

This does mean that you need to join tables on their IDs and ds columns. So if you wanted to know where a customer or an employee lived or what job they had when they clicked a page or bought a product, you could. But you have to make sure you have a WHERE clause that filters for the date you’re interested in.

So instead of dealing with the traditional slowly changing dimension approach where you’d have to use a “BETWEEN” clause, you could just join on the ds or dt. Those were the common way to name said column.

With this method, there are challenges, in fact we occasionally still had to create tables with a more traditional SCD Type 2 approach. But most use cases worked pretty well using the partition method.

Tracking Events In Big Tech

When it comes to handling event data, you might assume it’s not that hard.

After all, it’s a simple situation. For every new event, simply insert the data, right?

Maybe you have to delete some of the prior data and then insert but there isn’t a need to do it as much in terms of inserting, updating, and merging.

That all works great until your data gets so large it takes too long to compute or maybe you’re just looking to improve the performance of some dashboards. Then suddenly, you need to rethink how you manage event data.

This has been discussed by several other pieces of content but data structures such as date lists in a single field are an excellent way of reducing processing costs(the trade off often being that you lose granularity).

I really enjoyed the piece from Connor Skennerton where he reviewed a talk given by Roblox engineers who were using a data list for that purpose. Here is a quick snippet.

In his talk, Yan and William gave the example of scanning over a raw fact table that had 10 TB of data generated per day. Scanning all the historical data every time a query was run required looking at petabytes of data. In comparison, using datelist table they only needed to scan through 10TB + ~0.5TB per day. – Datelist tables at Roblox Data Engineering Meetup

But what does this look like in real life.

Data Modeling Example – Aggregating Data

Let’s go over the example below. As you can see in the table below, this could be viewed as a standard page visit style table, where each visit to a url is captured with a user_id and when the event occurred(you’d also likely capture other bits of information such as time spent on page, an event id, etc). For now, we wanted to focus on how this data can be reshaped.

You can take that lower-level data and aggregate into a table that aggregates number of page views by the day. After all many high-level dashboards don’t need to have all the information, such as exactly which pages were viewed.

Data Modeling Example – Creating A Datelist

Now if you wanted to take this one step further you could use a date list to aggregate all that information on a user-level. So now you only have one row per user. As referenced above, this becomes very useful on very large data sets because now you only need to look at the prior date partition and whatever users were active in the new date partition and merge those. Meaning you’ll never have to pull more historical rows of data than users. So instead of possibly dealing with tens if not hundreds of billions of rows or as referenced above, petabytes, you can drastically reduce the number. In turn, reducing how much data actually needs to be processed.

Zach Wilson showed a similar concept in a video he shared not that long ago with the title, Data modeling a 100 TB data lake into 5 TBs with STRUCT and Array.

This method of taking data to track events in my experience fits large tech companies rather well, both due to the metrics they focus on such as retention, monthly active users, etc as well as the data size they are dealing with.

I haven’t seen this specific data modeling technique used outside of tech organizations or at the very least outside of specifically reporting on retention and other metrics engagement metrics.

Overall, it fits this specific use case rather well. I would love to know if anyone has seen it used elsewhere!

Fact Data – At Most Companies

Now taking a step back from aggregating and reporting, let’s look at just fact data itself.

Again, it should be easy, new data comes in, you insert it.

Done.

But here is the gotcha.

Not all events and fact data are page clicks. Some are billing for your visit to the hospital, others are orders where people return products they purchased.

How operational systems handle that can be very different, although how you store the data might be straightforward. How you process it could vary.

For example, perhaps the system you’re pulling from sends a correction line (e.g. if you spent $50 on a product, but return it, it sends a -$50 in a new row with the same ID). You might also have to pull this information from a completely different table that is labeled returns.

Another example could be for healthcare claims where the price of a service might be dispute. In order to correct it, you either likely have to replace the data with a new row for that claim or aggregate a correction row. This is shown below where claim_id 5 needs a -$50 correction.

But in order to know how to set up your data flows, you need to understand how the data itself is created and stored. So making sure you go to the source and understand how it operates is critical when you model your data (of course, you should also talk to the business and understand the that side of the data model).

Overall, there are so many ways to store and process historical data to ensure you capture information and can report on it accurately.

This Article Is History

Now as I push this article, its creation gets put into a data warehouse somewhere as well as all your clicks. I hope you understand how various companies model their data.

Everything is a trade-off and you want to make sure you pick the right solution for the right job!

With that, I want to say thanks for reading!

If your team is looking for and advisor to help them improve their data strategy or data infrastructure, then feel free to sign up for a consultation.