Essential Skills for Data Engineers in the Age of AI

research@theseattledataguy.com August 8, 2024 Uncategorized 0

If you work in data, then AI is everywhere at this point.

But whether AI is hype or reality doesn’t change the fact that data engineers will play a major role in ensuring that the data sets that are utilized for the growing use cases are usable both by machines and humans.

Whether that data is structured or unstructured.

With the increasing focus on data and what it can do outside of basic analytics data engineers will have to learn a broader array of tools and skills. Joe Reis has touched on this a few times in his recent article where he calls out the fact that

“People in different disciplines are starting to learn each other’s craft. I’m starting to see software engineers and analysts learning machine learning/AI. Data scientists are learning to write production-grade code so they can work better with software engineers and integrate ML models into software applications.”

In the same way data engineers will need to learn more about what is going on in the world of machine learning and AI to be better prepared for this shift.

In this article we’ll review some of those skills and mental shifts data engineers will need to make and the new skills data engineers will need to learn.

Dig Deeper Into Data Modeling And Data Storage

For the past decade the usage of unstructured data has gained more and more popularity for analytical and machine learning use cases. After all, we now have the ability to store larger amounts of data and compute it quickly.

To pull another Joe Reis quote, today “working with multimodal data (images, audio, video, text, tabular, semistructured, etc) is becoming more common. The world isn’t just about tabular datasets anymore. Anyone who works with data must be proficient in modeling data of various types, shapes, and varieties.“

That means data engineers need to understand the implications of both computing and storing data that is vastly different from the traditional tabular data. It also means that there will be shifts in the storage layers we use. Whether it be solutions like Iceberg or the fact that some companies will need to move away from relying on solutions like S3. After all, there are other methods to store data. We just default to S3 and GCS because it is convenient. There are new emerging players such as VAST Data which has created their own storage and software layer to better fine-tune models and data processing.

Overall storage will play a key role as data becomes more complex and users expect that systems don’t just read tables but also pictures, PDFs, and more.

Learning New Concepts – The ML And AI Pipelines

Just storing the data isn’t all that needs to be done. Data engineers will have to be even more aware of how data is being used down the line and in some cases be involved with implementing it into ML and AI models.

That way they can help develop AI and ML workflows.

In order to do so they’ll need to pick up a few new skills and knowledge. Including:

RAG

Retrieval-Augmented Generation, is an approach in natural language processing that enhances the capabilities of generative models by integrating them with retrieval mechanisms. Essentially, RAG combines the strengths of large language models with a retrieval system that accesses relevant documents or data from a vast corpus. This integration allows the model to generate more accurate, informative, and contextually appropriate responses by leveraging external information. RAG is particularly useful in applications requiring up-to-date knowledge or specific details, as it can dynamically pull in the latest information during the generation process, leading to more reliable and insightful outputs.

If you’d like to learn more about RAG, then check out the content below!

Utilizing different types of compute such as GPUs

Not every data problem should be computed on CPU. That’s why many platforms are already offering the ability to utilize different sets of hardware with just a few clicks. For example, even when it comes to hardware and GPUs companies like CoreWeave and Nvidia have worked to make the lives of developers easy by building AI platforms with VAST Data that many of the big players already use. Suddenly you can be running GPUs.

But this also means data engineers need to understand the various benefits of GPUs and the types of processing that benefit from said hardware.

If you’d like to learn more about GPUs and how to process data with them, then check out the content below!

End to End AI Pipelines

AI and ML pipelines aren’t purely linear like traditional data pipelines. They have feedback cycles often integrated into them. This means data engineers will need to understand this paradigm both in terms of how to feedback the data and how the data can be re-integrated into operational workflows. This is just as much about learning about the different components in AI and ML pipelines as well as how the business can actually benefit from said workflows.

How do companies like Riot Games deploy their machine learning models?

How can we, the data engineers, learn from that?

And can we make it better?

If you’d like to learn more about AI and ML pipelines, then check out the content below!

MLOps

Now there is a lot that goes into MLOps that is included in the point above about feedback cycles. However there are also some concepts not covered like managing data drift, logging model performance, etc. All of this is where having strong MLOps skills can differentiate a data engineer. You can be the one that helps deploy the machine learning models or manages the feature engineering processes.

You don’t have to be limited to building standard data pipelines.

If you’d like to learn more about MLOps, then check out the content below!

In the end, data engineers won’t just need to learn more technical skills with the increasing demand for AI and ML use cases.

Applying Data Creatively Towards Business Problems

Whether you’re a data engineer or analyst, for the past few years the challenges we often faced were technical. Many of those problems are becoming simpler as platforms remove much of the technical friction.

This also means the value data engineers and analysts will add in the future isn’t just going to be solving technical problems but also being able to look at data and find interesting use cases.

There are multiple ways I have been reading recently about how some companies are utilizing AI to improve the performance of human workers by better detecting copyright infringement of pictures or helping improve the workflow bottlenecks for VFX artists.

In some ways this can be scary, the path is less defined. We’ll likely have to go beyond just calculating churn or some predefined metrics.

However, for those that have a good understanding of the business and can utilize some of the new tooling that is available, they will drive a lot more value.

What Data Engineers Will Have To Focus On Even More

Data Quality And Traceability Will Become Even More important

It goes without saying that I am a huge proponent of data quality. AI, despite what some might say, isn’t going to be a silver bullet that can solve data quality problems. Companies need to ensure that the data they are processing is accurate. I have written several articles about basic data checks but this won’t be enough. The future will require even more robust systems as the data increases in complexity.

Ensuring Data Security and Privacy

Security and privacy was already drastically important, but it will become even more important with AI. It’s terrifying to think that some of these models are being trained on data that shouldn’t be public or perhaps data that in the future could start connecting dots that would eliminate people’s privacy where they once thought they had anonymity. In turn, it’ll be key to design the data pipelines and ML models of the future with this in mind. Either reducing or eliminating data that could lead to breaches. And if companies need a financial reason then GDPR and the CDPA will provide one.

Embracing Real-time Data Processing and Stream Analytics

I am talking with more and more clients that are trying to bridge the time gap between their data and their workflows. Especially when AI and ML are involved. Part of the benefits of real-time is feeding the data directly into some ML system because a human will rarely be able to make decisions fast enough.

In some cases this might mean using solutions like Flink and in others it means that the data needs to be processed by the AI as it happens. This will require systems that can reduce the friction between analytical storage and transaction processing.

In the end, all of the skills combined with a solid understanding of the business domain you’re in and a little bit of practical ML and AI knowledge will help take you to the next level.

How Can You Be Prepared For AI?

I don’t think AI will be taking data engineering jobs any time soon. In fact, it’ll likely amplify the need for them. But it will also force many data engineers to upskill so they can handle all the new types of data types and workflows that end-users will want to create.

After all, it’ll be up to the data engineers to ensure that whatever models are created use data that is reliable, safe and understandable both by machines and humans. We also need to continue to learn more about what is going on in the AI and machine learning world. What types of data can these models use and how can we better prepare said data for these up and coming use cases?

Don’t let this overwhelm you. Hopefully, some of the links above are helpful in providing you context on the new terms and words that are being thrown around.

If you’re still hoping to learn more about this change and some of these skills, then you can check out these articles.

9 Habits Of Effective Data Managers – Running A Data Team

Migrate Data From DynamoDB to MySQL – Two Easy Methods

Is Everyone’s Data A Mess – The Truth About Working As A Data Engineer

Normalization Vs. Denormalization – Taking A Step Back

What Is Change Data Capture – Understanding Data Engineering 101

Why Everyone Cares About Snowflake

Explaining Data Lakes, Data Lakehouses, Table Formats and Catalogs.