Why You Don’t Always Need An ETL
Photo by Margaret Weir on Unsplash
One of the biggest roadblocks for data scientists looking to analyze data is getting access to that data. The data itself will need to be processed and placed into a central data system like a data warehouse or data lake. This is done by using what we call an ETL.
ETLs, also called data pipelines, are automated processes that extract data from various data sources, remodel or transform the data and load them into some sort of data warehouses, data mart or data lake.
These ETLs are the lifeblood of many organizations’ analytical teams. They provide data to the data science teams, dashboards and analysts. Data pipelines, while necessary, are very time consuming to develop and the data engineering and BI teams are often too busy to take on developing every new pipeline required.
This slows down data scientists and analysts who need to provide their management with clear and concise analysis. This also slows down ad-hoc analysis that business users need to make fast decisions or experiment with new ideas, campaigns, etc.
Bringing up the question, are ETLs always required. Especially when your team needs to turn around metrics and analysis quickly in order to adapt to external factors.
In this article we will discuss some of the difficulties that data scientists and analysts face while they work with data and outline why ETLs might not always be required.
Challenges With ETLs
Directors Need Data Quickly
If you have worked as a data scientist or analyst in large corporations, then you are accustomed to the constant fire drills leadership will run for data. It seems like everyday a new report is needed and a new analysis needs to be done.
Oftentimes this will require new sets of data to be pulled into your data warehouse. This will block your team’s ability to answer questions quickly because they will need to wait for the BI and data engineering teams to create the pipelines and infrastructure to support the new data sets.
A great example of the sudden need for new analysis is the recent push for remote work as well as reduced sales many companies are facing. Companies across the world more than likely needed insights into how their businesses would be affected.
However, due to the technical complexity of data engineering, it might have taken these companies a long time to get access to the data they need.
Data Is Siloed
Even once data is loaded into data warehouses, data scientists still face problems as far as where that data lives. Many large organizations will have data warehouses that are specific to departments like finance and operations. Since many of the questions data scientists will be trying to answer will encompass multiple departments, they will need access to all of these data sets.
Getting access to all of the data required is time consuming. Your data science teams will need to either have the IT team create multiple cross database connections or create some new form of centralized data warehouse. All of which only adds more complexity as well as more resources to develop.
In the end, waiting for the data engineering and BI teams to create ETLs is not always the best option.
Data Virtualization
ETLs and data pipelines aren’t the only option when it comes to ad hoc data access. Data virtualization is a methodology that allows users to access data from multiple data sources, data structures and third-party providers. It essentially creates a single layer where regardless of the technology used to store the underlying data, the end-user will be able to access it through a single point.
Overall, data virtualization offers several advantages when your team needs access to data fast. Here are a few examples of how data virtualization can benefit your team.
Faster Analytics
Directors, CTOs, and in general, decision-makers are no longer ok with waiting months to get a new report. As a data analyst, you used to be able to point to other departments that were slowing you down. You needed to put in data requests that would get lost in the sea of other IT requests.
But with so many self-service analytics tools, not getting access to data and developing reports developed quickly can be a major disadvantage. Your competition might already be getting insights on the newest happenings in the world while you are trapped behind an archaic system.
Data virtualization looks to increase the speed of which analysts can access data by simplifying the entire process. Thus improving the speed of your analytics. The goal being that now when a decision-maker asks a question, they can have an answer in a few hours or the next day. Not 3 months from now.
Reduces Workload On A Data Engineers
Data engineers and BI teams are often the bottlenecks for getting data analyst’s data. It’s no fault of their own. There are so many different initiatives and projects going on that it can be difficult to manage every ad-hoc data request that comes down the pipeline.
This allows your data engineers to focus on larger, more impactful work rather than focusing heavily on smaller data requests.
Simplifying Data Workflows And Infrastructure
Getting data from all of a company’s various database systems and third-parties is very complicated. This third-party API is SOAP-based with XML, another one only exports CSV reports and another is only updated every 24 hours.
This of course doesn’t even account for all the various database systems and cloud storage systems. The world of data is becoming more and more complex. This makes it hard to get all of your data into one place.
You need lots of ETLs, data warehouses, and workflows to manage all of the various data sets. Even then, sometimes all that data just gets siloed off for each team.
Making for a very complex and difficult world for the data analyst to work in.
Data virtualization circumvents that by connecting data sources virtually and not requiring a separate ETL for every process and data source.
Overall, simplifying your company’s data infrastructure and reducing the number of workflows required.
Data virtualization is infrastructure-agnostic.
This means you can easily integrate all data with whatever your companies current databases are, resulting in lower operational costs. You could be using Oracle, MySQL, Postgres, AWS RDS, and so many other database backends, but data virtualization’s goal is to integrate all of them into one final system.
Some of this is dependent on the data virtualization provider you choose. But overall, many of them are quite capable of integrating with most databases.
What Is An Example Of A Data Virtualization Tool?
There are lots of great tools out there that can act as a data virtualization layer. Everything from long standing products to new up and comers. One great example of a newer data virtualization product is Promethium.
Promethium’s data virtualization product that they have called the Data Navigation System or DNS acts as an augmented data management solution. The goal of this system is to provide analysts the ability to validate, discover, and assemble data easily.
This allows the end-user the ability to write faster ad-hoc queries to answer director questions faster. Better yet, even non-technical business users can use the no-code capabilities of Promethium where queries are automatically generated based on just asking questions.
It also provides a unified view of all the data sets available while detecting relationships between data sets. This means that as your data scientists and developers are working with the various data sets that Promithiem will provide insights into what data is likely related to other data sets. This allows easy modeling capabilities without impacting the production data. With Promethium, users have data discovery and data prep combined with data virtualization so that users can go from asking a question to getting a query executed in a matter of a few minutes in a simple workflow.
In turn this can make writing queries substantially easier. The goal of Promethium is to provide users with the ability to quickly discover data across data sources, determine what’s needed to assemble the data together, prep the data, and lasty query the data all without moving the data. This allows users the ability to quickly experiment or to arrive at an answer without spending months of time and effort. This also gives users a way to make sure what they want to run in the production ETL is indeed the best version. You see, Promethium can compliment with existing ETL where the existing ETL is used for production data pipelines whereas all the rapid experimentation can be done using Promethium without waiting.
ETLs Aren’t Going Anywhere, But
At the end of the day, ETLs, and other data pipelines aren’t going anywhere. They still remain the main method for creating production tables and products. However, from an ad hoc analysis perspective, there is value in considering data virtualization once your company has multiple data sources that become difficult to manage. This is not a one size fits all solution but it is proving to be a great option for large companies who need to quickly pull data from all of their various data lakes, data marts and data warehouses.