Should We Get Rid Of ETLs?
AWS has jumped on the bandwagon of removing the need for ETLs. Snowflake announced this both with their hybrid tables and their partnership with Salesforce.
Now, I do take a little issue with the naming “Zero ETLs”. Because at the very surface the functionality described is often closer to a zero integration future, which probably doesn’t come across ‘sexy’ enough. This may also only be phase one of AWS and Snowflake’s plan to remove the need for ETLs.
Overall, I do agree with the idea of reducing the amount of duplicate logic and data that exist. So if there is some form of path that leads to a zero ETL world, we should make it happen.
But what would it take?
In this article, I will go through a Zero ETL future and how we might get there.
The Problem Of Perception
When I read or hear about announcements like this, I assume there are undiscussed nuances. But I have found when the business reads these types of announcements, they take them at face value.
They come back and tell their team, we want to move to this no-code, zero ETL and serverless future. It all sounds good from a business perspective: costs will be reduced, head counts can be slashed, and value from data can be gained immediately.
But this will skip over all the other nuances that are unavoidable.
Before diving into a Zero ETL future, let’s review some of the reasons ETLs exist.
Why We ETL
Simply duplicating data from point A to point B is not an ETL. If that’s all we needed to do, we could just create replicant databases and report off of those. So why create complex systems and use tools like Airflow or Prefect at all?
Why hire expensive data engineers?
Why even ETL?
Historical Data
Generally speaking, most operational databases don’t track historical data. Specifically they don’t track historically changing entity data like where a customer lives.
So when data gets updated or deleted, if you’re not using CDC(change data capture), you’re going to lose information. This is why the concept of slowly changing dimensions exists – to help track information to ensure if a customer moves states or an employee changes jobs, we can accurately reflect this over time.
Now don’t get me wrong. What you could do instead of a traditional SCD(slowly changing dimensions) is simply create a date partition and load data into a table that is ever expanding. We did this from time to time at Facebook because it’s a simpler design.
But it can also make the data harder to query for analysts and data scientists. There were many cases where I had to build a secondary table that would add in the logic to track the changes of each date partition.
Thus, in the end, it can be very tricky to remove the concept of slowly changing dimensions from a data model (but maybe we can automate it?)
Integrating Data
One of the major benefits of working at Facebook was how well integrated the data was at the application level. There was an entity layer that existed that anyone could pull from while they were building their new features which meant it was easy to join entities together both on the application side as well as the analytical side.
By the end of my time working at Facebook we had gotten most of our core pipelines down to under 30-50 lines of code because it was mainly just configuration. Even the SQL part could be some what removed. Much of this was driven by the fact that because the entity layer was well developed, we didn’t need to write complex queries to attempt to join data.
This is not always the case. Plenty of companies have data in all shapes and sizes and thus it can nearly be impossible to integrate data across sources or it’ll require an extra few hundred lines of SQL just to be able to join across entities.
Ease Of Use
Data pulled from operational databases such as MongoDB or MySQL is often structured in ways that are difficult for analysts to work with. Can you imagine handing an analyst a file of data from MongoDB? It’d be deeply nested and error-prone to massage into any shape that is usable.
In the same way, a heavily normalized data model from a traditional MySQL database would also pose issues. Simply copying data into Redshift or Snowflake would force analysts to write queries that needed multiple joins and possibly business logic.
Again, error prone.
Much of what is done to data, whether it be standardizing naming conventions or remodeling the data altogether, is meant to make it more accessible by analysts. Not just by putting it into another databases, but by treating it as a product that users interact with.
All that being said, I do believe there could be a future where the ETL, at least to create the core layer of data, could be removed.
What Would Be Needed For This Zero ETL Future?
Entity Layer
If more companies could build systems that are integrated at the application layer, there wouldn’t likely be any issues in terms of integrating data.
This would also put a lot more responsibility on the application team and SaaS solutions. I do believe this would pose a significant challenge and would be a major cause for why a true zero ETL future would be hindered.
Naming And Data Type Conventions
Standardizing naming and type conventions would save a lot of tedious work. I think it’s fair to ask…why are data engineers still having to write little “t” transforms?
At the end of the day, it’s probably best we just all agree to start naming all fields with standardized convention.
That would allow us to also standardize data types. Of course software engineers would also have to agree on date format.
Producers Ownership
Data warehouses or any pure-play infra layer unassisted cannot solve ZeroETL for SaaS apps. This vision can be achieved only if SaaS vendors take responsibility for moving data to their customers’ DW. – Arun Thulasidharan
On top of being responsible for moving the data, the SaaS vendors and data producers will have to make sure they manage all the logic and data changes.
The hardest part of removing the ETL is the T. Not the little basic transforms such as standardizing naming conventions and data types( actually, why isn’t this already automated). Even adding in slowly changing dimensions, could arguably be generalized.
No, there’d be quite a bit of further transforms after that. Any business logic or weird enumerator that calls for manual management would have to be dealt with by the producers of the data, whether that’s a company application or Salesforce.
The Truth
Some might assume that because I started as a data engineer, I want to defend the ETL, a common task we take on. However, the world changes, and if the world no longer needs data engineers to build ETLs? then so be it.
Now I don’t think this will be in the next two years. Especially not at the enterprise level, for whom two years is the time required to migrate off of current solutions (on a good day). So even if they started today, we are a long way away from going Zero ETL.
What are your thoughts, can you see a world without ETLs? What would it take?
26 Data Catalogs – From Open Source To Managed
Will Data Engineering Exist In 5 Years – Is Data Engineering A Good Career Choice?
Reducing Data Analytics Costs In 2023 – Doing More With Less
Databases Vs Data Warehouses Vs Data Lakes
Onboarding For Data Teams – How to set-up a streamlined onboarding experience