How To Set Up Your Data Infrastructure In 2025

How To Set Up Your Data Infrastructure In 2025 – Part 1

research@theseattledataguy.com April 15, 2025 data engineering 0

Planning out your data infrastructure in 2025 can feel wildly different than it did even five years ago. The ecosystem is louder, flashier, and more fragmented. Everyone is talking about AI, chatbots, LLMs, vector databases, and whether your data stack is “AI-ready.” Vendors promise magic, just plug in their tool and watch your insights appear. There’s this pressure, almost an anxiety, that if you’re not redesigning everything around artificial intelligence, you’re already behind.

But underneath all the noise, the fundamentals of good data infrastructure haven’t changed. You still need reliable pipelines. You still need clean, usable data. And you still need systems that support the people (and now the machines) who are trying to make decisions from that data.

So how do you, as a data leader, design infrastructure that prepares you for what’s next, without getting distracted by hype?

What Hasn’t Changed

In many ways the basics for answering key business questions haven’t changed. In fact, I’ve worked for several clients who are modern AI companies who needed help building their analytics infrastructure.

At the end of the day, all companies still need to be able to ask questions about their customers and operations and they want to be able to track it over time to see if changes they make have some impact.

The point I usually will add is that now we can often do it faster and, if you have ways of collecting the right data, more personalized. Nevertheless, the key layers of data infrastructure are still important. That is often:

Data pipelines, often in the format of ETL/ELT
Orchestration(although this sometimes melds with data pipelines)
Data Storage
BI/Data Science

In between those layers there are plenty of other tools you can add. But as you’re setting up your data infrastructure and picking your tools for your data stack, you’re going to need these tools.

Data Pipelines And Ingestion – Getting Data Into Your System

Data ingestion is the process of moving data from various sources into your analytical environment, typically a data warehouse or lake. You might hear this referred to as ETL, ELT, real-time pipelines, or just data pipelines in general.

There are three common approaches:

Fully custom code
Low-code ingestion tools (like Estuary, Portable, Fivetran) often paired with transform tools like dbt or SQLMesh
Code-based frameworks (like Airflow, Mage, Dagster)

Let’s break down the differences.

Custom Code

Custom code is exactly what it sounds like. You will be the one writing your API, database, and SFTP connectors, as well as any transform workflows. Now how much of this you take on can very. Some people will use a combination of custom code plus a framework like Airflow to mange their data workflows.

Still others will try to create their own orchestration tool, which I don’t recommend. But there can be benefits to writing your own code. You don’t have to go through procurement or sign contracts, and in some cases it can be cheaper(but don’t forget to consider total cost of ownership).

Low-Code Tools

There are many valid reasons your team might go down the low-code/no code path. One can be because your data team is small, another valid reason is you want to spend your teams time elsewhere. Does it make sense for you to write custom code? Do you get an advantage when you’re pulling data? If not, then it might make sense to look into other platforms for you data infrastructure.

Platforms like Estuary, Fivetran, and Portable. These tools can help you set up data pipelines through a UI, with minimal (or no) code. They connect directly to many popular tools and APIs and automatically keep the data in sync.

Now in this case many of these tools will need to be partnered with a transform tool like dbt or SQLMesh.

Frameworks

Tools like Airflow or Dagster give you a lot of control. They help orchestrate complex workflows, support a wide range of data sources, and offer out of the box operators (like BigQuery hooks, file sensors, and HTTP operators). But they also require real engineering effort, some DevOps experience, an understanding of Python, and time to manage the system.

These frameworks are best for teams that have dedicated engineers and more complex data pipelines.

Data Storage – Building a Central Source of Truth

Once your data is ingested, it needs a place to live. That’s where storage systems come in, most often in the form of a data warehouse, data lake, or lakehouse.

This layer acts as the foundation for analysis, machine learning, and reporting. It unifies disparate data sources into one environment and enables cross-functional access.

Common Storage Solutions:

Data Warehouses and Platforms: Snowflake, Databricks, BigQuery, Redshift
Storage Formats: Apache Iceberg, Delta Lake, Hudi

Each of these platforms has different performance, pricing, and tooling considerations. Warehouses are often better for structured analytics and dashboards. Lakes/lakehouses offer more flexibility for raw and semi-structured data, especially for ML workloads. At least that’s what most people will say. I have seen a mix of everything. In my experience this is highly dependent on how big your team is, the use cases, etc. For everyone using a Lakehouse for their ML workloads, there is someone else using SQL Server.

Nevertheless, many goals don’t change. For example, your goal might be to create a centralized data layer where teams can work from the same definitions, metrics, and logic.

Instead of pulling separate reports from Workday, your application database, and Facebook Ads, your team can write one query to join and analyze everything in one place.

A good storage layer also unlocks:

Standardized data models
Governance and access control
Data quality and consistency
Faster time to insight

Without this layer, teams end up manually pulling data into spreadsheets, merging CSVs, and duplicating logic, a recipe for confusion and inconsistency. That’s where building a central data analytics platform is key.

Data Visualization – Turning Data Into Decisions

Ingestion and storage are necessary, but not sufficient. To drive impact, you need to help people actually use the data. After all, a bunch of tables in Snowflake don’t really provide much value.

That’s where data visualization and reporting come in.

Dashboards, KPIs, self-service tools, and even Excel reports are all ways teams consume data and make decisions.

When done right, reporting tools:

Provide executives with clarity
Enable teams to track initiatives
Help analysts test hypotheses

Of course the key point here is, “when done well” it is difficult to create dashboards and reports that executives can act on. If you don’t believe me, just ask anyone about their dashboard graveyard.

Common BI Tools:

Tableau
Sigma
Power BI
Metabase

Each tool comes with trade-offs. Looker, for example, has a steeper learning curve but offers a semantic modeling layer that ensures consistent logic. Tableau is highly visual and user-friendly but can become siloed if not governed.

Choose your tool based on the needs of your team and the complexity of your reporting environment not just based on popularity.

The ultimate goal? Build reports that are trusted, actionable, and aligned with how your business operates.

Data Science & Notebooks

While dashboards are built for monitoring and decision-making, data science tools support exploration, experimentation, and advanced analysis. This is where notebooks and model development environments come in.

Not every company needs machine learning from day one. But as your data maturity grows, you’ll likely want to go beyond just looking at the past, you’ll want to forecast, segment, classify, or personalize based on that data.

That’s where tools like Jupyter notebooks, Databricks, Hex, or Deepnote play a role.

What Makes Notebooks Valuable

Notebooks are ideal for:

Prototyping models quickly
Exploring data visually with Python, R, or SQL
Combining code, charts, and documentation in one place
Sharing insights across teams or stakeholders

They’re especially useful for cross-functional work, where a data scientist wants to show not just the results, but how they got there.

Some platforms (like Hex) even allow you to publish notebooks as apps or dashboards, blurring the line between exploration and production.

Use Cases for This Layer

Customer segmentation
Predictive modeling (churn, sales, etc.)
Experiment analysis (A/B tests)
Time series forecasting
NLP for internal documents or chat data

Collaboration and Governance

One of the common issues with notebooks is that they can become siloed, undocumented, and hard to reproduce. To avoid this:

Store notebooks in version control, many platforms now have this integrated in
Connect them to the same curated data models
Establish clear handoff processes from notebooks to production pipelines

Think of notebooks as the R&D lab of your data stack. They don’t replace pipelines or BI, they complement them by enabling deeper questions and experimentation.

As your team matures, this layer becomes more important not just for data scientists, but for analysts and product teams who want to explore without committing to production changes right away.

This will get you started, of course you’ll want to include some layer of data quality but let’s start with the initial set-up and you can build from there.

What You Should Prioritize

With so many tools and trends flying around, it’s easy to get distracted. But if you want your infrastructure to serve your business now and scale in the future, you need to focus on the right foundations.

Here’s what should top your priority list:

Treat Data Like Software

Much of your data infrastructure should follow the same rigor as software engineering. Now not everything is exactly the same, in fact when it comes to testing and data quality this can be very different but there are plenty of aspects we can borrow from software:

Version your pipelines
Monitor changes to critical tables
Test transforms before they hit production
Use CI/CD for deployment

The more your data ecosystem grows, the more important it becomes to introduce these engineering best practices. Observability, testing, and ownership aren’t luxuries, they’re guardrails.

Build for Flexibility, Things Will Change

There’s a lot of debate around data warehouses vs. lakehouses vs. vector stores. The answer isn’t picking the “winner.” It’s picking the right abstraction for your use case, and leaving room to adapt.

Architect your system so you can swap in new tools without rebuilding everything from scratch. Composability matters more than chasing the newest acronym.

Balance Innovation With Maintainability

Innovative tools feel exciting in the moment. But if they lack community, documentation, or long-term support, they become technical debt in disguise.

Before adopting something new, ask:

How many other companies like ours use this tool?
Can we find help online when it breaks?
Will a new hire in six months be able to understand what we’ve built?

The best stacks aren’t just modern, they’re maintainable by teams that didn’t build them.

How to Future-Proof Your Stack

No one can predict where the data world will be five years from now. But you can make smart bets today that prepare your team to grow, adapt, and keep delivering value.

Here’s how:

Create Space for Experimentation

If everything in your data stack has to go through a rigid governance process, innovation dies. But if everything is experimental, nothing scales.

Solve this with sandbox environments, feature flags, or dedicated innovation tracks. Give your team room to try new things, without putting core infrastructure at risk.

Invest in Documentation and Data Literacy

It’s tempting to deprioritize documentation, but it’s one of the highest-leverage investments you can make.

Good documentation reduces onboarding time.
It prevents silent knowledge from walking out the door.
And it empowers analysts, engineers, and even non-technical teams to use data responsibly.

But documentation alone isn’t enough, you need a culture of data literacy. Run internal training sessions. Encourage questions. Build dashboards with context. Your infrastructure is only as powerful as the people who know how to use it.

Choose Tools With Strong Ecosystems

Don’t just evaluate features, evaluate the community around a tool:

Is the documentation good?
Are new releases stable?
Is there a robust Slack or GitHub presence?

Open-source tools are only as good as their maintainers. Proprietary tools are only as good as their support and roadmap.

The best way to de-risk your stack? Choose tools that others have bet their business on, and that will still be around in two years.

Final Takeaway

A lot many have changed with all the fancy new tools and sales pitches. But what your team is trying to do hasn’t. Your goal is to deliver reliable data and insights that the business can take action on.

Much of that requires data pipelines and reliable data models that are built for both humans and machines to interact with. So build your data stack using tools that fit your needs. I referenced a few reliable ones in this article, but if your team is looking for help setting up their data stack feel free to set up a free consultation.

I’d be happy to discuss your use cases and business needs!

That’s how you build data infrastructure that is both future proof and answers the problems you have. You go over your business needs, find what is valuable and go from there.

Now this is just part 1, I’ll be discussing ML and other aspects of data infrastructure in a future article.

Also! Don’t forget to check the articles below.

ETLs vs ELTs: Why are ELTs Disrupting the Data Market? – Data Engineering Consulting

NetSuite to Snowflake Integration: Ultimate Guide to 2 Effective Methods

Bridging the Gap: A Data Leader’s Guide To Helping Your Data Team Create Next Level Analysis

The Data Engineer’s Guide to ETL Alternatives

What Is Snowflake – Breaking Down What Snowflake Is, How Snowflake Credits Work And More

Explaining Data Lakes, Data Lake Houses, Table Formats and Catalogs

analytics Big Data data engineering

How To Set Up Your Data Infrastructure In 2025 – Part 1