Improving Your Data Analytics Infrastructure In 2023 – Part 1
Data has been consistently demonstrated to be a valuable asset for businesses of all sizes. Consulting firms, like McKinsey, have found that companies using AI and analytics attribute 20% of their earnings to it. As a consultant, I have personally witnessed how data can uncover new sources of revenue and cost reduction opportunities for clients
However, in order to fully take advantage of the benefits that modern data has to offer, it is necessary to have a well-designed data infrastructure in place. It’s important to note that this does not necessarily mean investing in the most expensive data tooling – in fact, I have often recommended simpler and more cost-effective solutions to clients. In this article, we will explore what to avoid when building a data architecture and consider key questions to guide the development of a future-proof data infrastructure.
What You Should Avoid When Building Your Data Infrastructure
Developing robust yet simple data infrastructure can be challenging as there are numerous potential pitfalls to avoid. One key issue to focus on is preventing the accumulation of technical debt. A well-functioning data system that is easy to manage and cost-effective is essential for effective data management.
However, we have encountered numerous companies that use ineffective methods for running data pipelines and managing data, such as patchwork ETL pipelines and reliance on Excel for data transformations. In this article, we will examine these examples in more detail.
Duct Tape Data Stacks
Look, I know plenty of people that have done it.
They have tried to create their very own meta-database and orchestration tool to run their data pipelines.
These often end up being very complex data pipelines that are held together by a few overworked data engineers struggling to
This style of architecture remains a popular solution because it is so easy to set-up. Depending on how complex your pipelines are, how many you have and how often they are run. It really might not be the worse solution to have a few scripts that run your pipelines 1-2 times a quarter.
But.
Eventually, as your teams start needing daily updates, and live data feeds, you will have to switch over to a more mature and modern solution.
There are a whole host of solutions here that we will discuss. But just know, to mature in your data strategy, you will need to step away from this patchwork style architecture.
It creates a lot of technical debt and then you will need to hire some like me to come in and untangle your data pipeline mess.
Running Data Pipelines And Processes Too Often
Many data pipelines these days run on the cloud. This makes it easy to work with while all your teams are remote and you don’t need to mess with on-premise servers.
It can also lead to a lot of unnecessary costs because your engineers decide to run a data pipeline once every hour even though it supports a report or dashboard that is only viewed once a week. In fact, I covered some of this in a newsletter about cutting data infrastructure costs.
Now depending on how you have your ETL set up, this could be very expensive. especially if the query reruns over all of your data every time.
This could lead to 10s if not 100s of thousands of dollars in extra costs annually.
When you develop ETLs and data pipelines. It is important to understand why and what you are building them for.
This can ensure that your data engineers and developers pick the right choice of cadence as well as the right tools for the job.
The “Let’s Just Do It All In Excel And Macros”
Excel is a great tool for doing ad-hoc analysis. It’s easy to create formulas, quick charts, and some quick slicing and dicing.
One thing I feel it doesn’t do well is act as a data pipelining tool or key component in your data infrastructure. This is to say, I have seen engineers develop systems that somehow have Excel orchestrating entire data pipelines through a combination of Macros and scripts. I have nothing against Excel macros.
They have their place. But overall, they don’t tend to be a good choice when it comes to developing enterprise data infrastructure.
This is because the code in Macros is often isolated in the said Excel document and isn’t built to easily integrate with all the other complex systems externally. Compare this a coding language like python that can easily be used to create data pipelines, run models or act as an API.
Creating data infrastructure in Excel can actually sometimes look very similar to the cron run option. It comes across just as unruly and will require you hiring a data engineering or data science consultant to fix the mess.
Ok, let’s step away from what you should avoid.
What To Consider When Your Building Modern Data Infrastructure
To pick the right data tools for your data strategy, you first need to have a data strategy.
For example, if your team is just starting and you are looking to create a monthly report that pulls from 5-6 different data sources.
You already know you probably won’t need a streaming data solution.
So step one to picking the correct architecture is deciding what you want to do with your data.
If you don’t already have an idea of what your team wants to do, then you can check out our guide for creating a data strategy.
Once you have a data strategy you can start to assess where your team currently is and where it is going.
There is a broad range of tools when it comes to setting up your data strategy.
Overall you need to consider a few key problems.
- What are your data goals?
- How will you get your data from your source systems to your data warehouse/data lake(house)?
- What data warehouse will you use?
- What tool will you use for your data visualization?
- And do you plan to do any form of machine learning or data science?
These are some of the core questions you can answer.
Our team is currently working on a data strategy book that will help answer these questions in detail, you can sign up here to get it as soon as it comes out. But let’s start to answer the first question.
How Will You Ingest Your Data
The very first question is one of the more important ones you will answer. Truthfully, this is your first question once you have already answered what you will be doing with your data. But let’s continue assuming you have done that.
Because the solution you pick here will define how much work and maintenance your future engineers will need to do.
For example, earlier in this article, we discussed using crontab to create a patchwork system that is all duct-taped together will be very challenging to work with.
But there are more modern data pipelines and ETL solutions that your team can use to avoid many of the problems that occur when you try to develop your own data infrastructure.
ETLs
ETLs are the classic data engineering bread and butter method for getting data from point a to point b. This is the process of extracting, transforming, and then loading data into your data warehouse or data lake.
ETL tools range from low code options which we will discuss those in the python frameworks section to drag and drop options like SSIS to Talend.
ELTs
ELTs are very similar to ETLs except for the order of steps that are done. Instead of your standard ETL where the transform is done in the middle and ELT focuses on just getting the data loaded first.
Then later on a transform may be applied if required.
There are a lot of tools that are trying to push this method. For example, Fivetran has gained a lot of traction( as has Portable on the long tail connector side). Now, you can in theory also create an ELT with other tools like the Python Frameworks we will be discussing shortly.
The difference is that many of the tools such as Fivetran already have connectors to popular data sources so you don’t need to write any code to connect to things like Salesforce.
Python Frameworks/ This Could Also Be Under ETL/ ELT
Python libraries like Airflow and Luigi (just to name two) have become very popular over the past few years. The reason for their popularity is because of the ability to balance customization with pre-built meta-data databases, logging, dependency management, and several other bells and whistles that some python frameworks provide to build an ETL.
People were already building hacky cron and python data pipelines.
So once many of them were provided with a framework like Airflow, it quickly caught on.
What makes these tools unique is the fact that you can so easily use
Streaming Tools
Data pipelines are limited to batch-style tools like Fivetran or Airflow. Another popular method for getting data from point A to point B is using streaming tools.
Using streaming technology like Kafka, Kinesis and Rockset allows you to stream events from any source and then run transforms or data science models over the data as it goes through the streams.
Often streaming data will initially be loaded into a data lake but it could also be loaded into a data warehouse. Often the reason developers will pick a streaming data solution is that the business or customer may need to know the information right away. For example, let’s say your company has multiple factories it is managing around the world. And if a machine goes down in any of them, it could cause major delays or other issues.
Would you want to get the information of a likely failure the next day when the batch ETL job runs or right as it is happening?
Streaming provides the ability for data engineers to create systems that can provide up-to-date information.
We won’t go into this now. But one of the questions I get all the time is: hy not have all your data live all the time?
This will require a separate article. However, currently, the amount of technical expertise required to maintain, use and migrate to streaming is heavy. So if it doesn’t make sense to run data all the time, you probably shouldn’t.
Do You Need To Modernize Your Data Analytics Architecture?
I will spend more time diving into some of the other questions in the future as well as in my data analytics strategy guide I will be putting out.
But, if you need to ask some questions about your data infrastructure today, or you want to know how you could use your data to help increase your revenue or reduce your costs, then feel free to set up some time with me.
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.
Data Governance for Modern Organizations, Part 1
26 Data Catalogs – From Open Source To Managed
Reducing Data Analytics Costs In 2023 – Doing More With Less
Databases Vs Data Warehouses Vs Data Lakes
Onboarding For Data Teams – How to set-up a streamlined onboarding experience
analytics Big Data data engineering Data Science Machine Learning sql