Automating File Loading Into SQL Server With Python And SQL
by Ben Rogojan
Loading files into databases are not exactly exciting work. It can be tedious and boring work. Often times it is constrained by tools like SSIS that freak out when a small change is made.
This is one of the core roles of BI and data engineering teams. They need to load files from operational databases into data warehouses. They do this by developing pipelines that process and transform data.
One problem we see is teams have multiple file types, data spec versions, and data providers. Each of which requires a different pipeline just to load into their raw database.
This forces teams to manage 10, 20 maybe even 50 smaller pipelines that are all subject to the change of the data providers. In addition, each new data provider requires an entirely new pipeline to be developed. It also requires managing infrastructure around each pipeline in order to log and track data pipelines statuses. This can quickly become very difficult to manage and tedious to build.
Another solution instead of attempting to manage multiple pipelines is to develop a central system that manages the various insert statements required to insert data into each different raw table. Kind of how we explain in the video below.
What the system does is essentially act as a lookup system that matches flat files to their insert queries. This reduces the need to redundantly have separate pipelines that would also all need an insert statement.
Instead, now the system uses one main system to load all the various tables. Theoretically, you can also put version numbers on each of the files that come in. This could be set up by tracking each version in the metadata table that will then allow you to tie a raw file to an insert file version.
Loading multiple file types by individual pipelines requires a lot of tedious development. Whether it be through SSIS or another system, the constant development quickly weighs heavily on data engineers. Instead of spending time developing analytical pipelines that drive clear value for your company. They will be stuck maintaining and developing operational pipelines.
This reduces its impact and reduces its overall value.
Thank you so much for reading! We hope you enjoyed the video. Please feel free to send us any questions. We would love to help you design your next automation system.
Also, if you liked this, feel free to check out our video on using bulk insert!
How Men’s Wearhouse Could Use Data Science
How To Use R To Develop Predictive Models
Web scraping With Google Sheets
What is A Decision Tree
How Algorithms Can Become Unethical and Biased
How To Develop Robust Algorithms
4 Must Have Skills For Data Scientists