The Basics of SFTP: Authentication, Encryption, and File Management

The Basics of SFTP: Authentication, Encryption, and File Management

December 23, 2024 data engineering 0
sftp how to

If you’re looking to pass hundreds of GBs of data quickly, you’re likely not going to use a REST API.

That’s why every day, companies share data sets of users, patient claims, financial transactions, and more via SFTP.

If you’ve been in the industry for a while, you’ve probably come across automated SFTP jobs that do just that. You’ve also likely had to encrypt or decrypt a CSV and had to interpret a schema file with parsing instructions that—somehow—are always a bit off the first time.

Now, sure, we all want to build real-time systems that use LLMs and other flashy new tools and solutions. Sometimes, that’s not what is called for.

After all, companies of all sizes—even tech giants like Facebook and Airbnb—still use SFTP to share critical information for analytical purposes(as well as operational). So, let’s dig into what SFTP is and how you will likely work with it.

What Is SFTP?

For those who may not know, SFTP (Secure File Transfer Protocol) is a method businesses use to securely transfer files over the internet. You can think of it as a safe digital mail system, where files like documents or CSVs are the “packages.”

SFTP is widely used to send data files—such as CSVs, TSVs, or pipe-delimited files—to a server. Once there, a scheduled job or event-triggered process typically retrieves the file and loads it into an internal or external database.

Why Do Companies Use SFTP?

Extracting data from APIs isn’t always the best option. Setting up an API takes time and resources to configure and maintain, while a basic SFTP job is considerably more straightforward to get up and running.

Here are just a few examples where I have seen SFTP jobs set up:

  • Aggregating and analyzing data and creating reporting on data such as healthcare claims.
  • Partnering with another organization to deliver a service to customers.
  • Coordinating with a third party to send physical mail (more common than you might think).
  • Sending payroll files from stores or franchises to a centralized payroll system.

In fact, I have now worked with several companies that use SFTP because they centralize similar data sets from dozens, if not hundreds, of companies to create reporting and analytics. Essentially, they act as a central hub, as pictured below.

sftp data workflows

 

From there, the various files from the companies would be processed into their similar entities. For example, insurance claims data is often broken down into patients, eligibility, claims and RX at the very least. This allows you to start to build data products that range from algorithms to dashboards.

One of the points that is implicit here is that unlike the company sending the data, the aggregator has the benefit of now having multiple companies data on the same entities. Meaning you can provide a very unique value add which is bench marking. Because you can now help your customers of your data product better understand where they stand compared to their competition(assuming they are open to their data being used that way of course).

All of which starts with getting the data via SFTP.

But what do you need to consider when setting up SFTP jobs?

Setting Up Your SFTP Job

Extracting data from files via SFTP isn’t terribly complex. In fact, some might consider it boring and not as exciting as building data systems that can manage exabytes of data. But there are plenty of companies that will need this work done, regardless of whether you like it or not.

Which is why I wanted to write about this topic.

So, to securely manage SFTP jobs, there are a few essential elements to keep in mind—starting with user authentication and encryption.

Authentication

Just like with APIs, user authentication is essential when working with SFTP. Here are two common methods (along with whitelisting):

  • SSH Keys: SFTP typically relies on SSH keys for user authentication. Instead of using a password, you set up a public-private key pair: the server holds the public key, while the client keeps the private key securely. When you need to connect to the server you use the private key to authenticate.
  • Password Authentication: This is exactly what you’d think. A username and a password.

File Encryption

Beyond authentication, many files transferred via SFTP are also encrypted using PGP (Pretty Good Privacy) Encryption. This is especially true when data has any form of PII, PHI or other forms of data that needs to remain secure.

There are a few ways a file might be encrypted, so it’s best to establish a standard approach with your external partners. Otherwise, you may find yourself juggling multiple encryption methods, which can be tedious when you write out your SFTP jobs (as I learned firsthand at Facebook).

Some companies might use a key pair approach, others might use a password and still others might use a combination. Here are a few explanations of each.

PGP Key Pair Encryption

 

sftp data engineering

This is the most common PGP method I’ve encountered. I’ve used it when working with insurance claims data, HR data, telecommunications data, and more. Each of these had sensitive information that benefited from being encrypted on transfer.

PGP key pair encryption does this by using a public and private key for secure communication: the sender encrypts the file or message using the recipient’s public key, and the recipient decrypts it with their private keyThis also means you will have to learn how to create a public key pair.

PGP Signatures (Signing Process)

PGP can also verify the authenticity and integrity of a file or message through digital signatures. This means the file will be checked to ensure you actually know who sent it. In this process, the sender “signs” a message or file using their private key, creating proof of origin. The recipient then uses the sender’s public key to verify the signature, ensuring that the message or file hasn’t been altered and confirming the sender’s identity. I’ve used this for companies that were very concerned about possibly having a bad actor gain the ssh key or password and username information and send a file over that could contain malware or other malicious scripts or injections. Especially since the processes to extract the data are automated, most people likely don’t go back and check the hundreds of files they are receiving to see if they are malicious.

PGP Encryption + Signature (Combined Process)

Now, not finally in terms of methods that exist, but for the last example I’ll give, PGP can combine encryption and signing in a single process for added security. This means the message or file is both encrypted for confidentiality and signed for authenticity—a practice adopted by several teams I’ve worked with.

Here’s how it works: first, the file or message is signed with the sender’s private key and then encrypted with the recipient’s public key. The recipient decrypts the message with their private key and verifies the signature using the sender’s public key.

Once you’ve got these two basics down, you can start considering the overall process and how you can ensure the data being processed is accurate.

Schema, Header, and Aggregate Files

When setting up an SFTP job, one of the first requirements is often a schema file. This file outlines the expected fields and their order, ensuring both sides have a common structure. Typically, companies prefer to define the schema to streamline integration on their end.

Especially if you are a data analytics company, the last thing you want to deal with is 100 different schemas that you now have to map. It doesn’t mean it doesn’t happen, of course. You should also set up a process where clear communication happens before said schema changes(cries in data engineering).

Once the schema file is agreed upon, there are a few additional files involved that might not have to do with the data, such as the header and aggregate files.

  • Header: Sometimes, a header will exist in the data file, though this isn’t always the case. When present, the header allows the receiving party to check the validity of the schema. Because, guess what? On more than one occasion, another company changed the schema of the file they were sending me months after we had agreed to the schema.
  • Aggregate File Checks: Less common but useful, an aggregate file is sometimes provided to double-check the integrity of the data in your main file. The file often includes an aggregate figure that describes the main file—like the total number of rows, total dollars spent, or unique users. This kind of acts like a Checksum, although it’s usually a separate query that was run for the raw file.

Will We Ever Replace SFTP With Data Sharing

There are likely tens of thousands of SFTP jobs that are still running today. Who am I kidding, there are probably even more than that.

However, a concept like data sharing(which tools like Snowflake and Databricks allow for) could help nullify the need to create so many data copies—making both APIs and SFTP less necessary.

With data sharing, companies could access shared datasets using just a few SQL commands, eliminating the heavy lifting involved in SFTP or API setups. That said, this shift is likely still 5-10 years away from becoming standard.

How To Send Data Via SFTP

Until then, let’s look at what a basic script would look like to interact with an SFTP.

import os
import paramiko
import gnupg

# Setup for SFTP and PGP
SFTP_HOST = 'your_sftp_host'
SFTP_USERNAME = 'your_sftp_username'
SFTP_PASSWORD = 'your_sftp_password'  # Use None if using SSH key authentication
SFTP_PRIVATE_KEY = 'path/to/your/private/key'  # Set to None if using password
SFTP_REMOTE_PATH = '/path/to/your/encrypted/file.gpg'
LOCAL_FILE_PATH = 'downloaded_file.gpg'
DECRYPTED_FILE_PATH = 'decrypted_file.txt'

PGP_PASSPHRASE = 'your_private_key_passphrase'
PGP_PRIVATE_KEY = 'path/to/your/private/key'  # If not already imported into GPG

gpg = gnupg.GPG()

# If the private key is not imported, import it
with open(PGP_PRIVATE_KEY, 'r') as key_file:
    gpg.import_keys(key_file.read())

Connect to the SFTP server and download the file using paramiko
def download_file_from_sftp():
    transport = paramiko.Transport((SFTP_HOST, 22))

    # Authentication: Use either password or SSH private key
    if SFTP_PASSWORD:
        print("Authenticating with password...")
        transport.connect(username=SFTP_USERNAME, password=SFTP_PASSWORD)
    else:
        print("Authenticating with SSH private key...")
        private_key = paramiko.RSAKey.from_private_key_file(SFTP_PRIVATE_KEY)
        transport.connect(username=SFTP_USERNAME, pkey=private_key)

    # Start the SFTP session
    sftp = paramiko.SFTPClient.from_transport(transport)
    print(f"Connected to SFTP server: {SFTP_HOST}")

    # Download the remote file
    sftp.get(SFTP_REMOTE_PATH, LOCAL_FILE_PATH)
    print(f"Downloaded file: {SFTP_REMOTE_PATH} to {LOCAL_FILE_PATH}")

    # Close the SFTP session and connection
    sftp.close()
    transport.close()

#Decrypt the file using PGP key pair
def decrypt_pgp_file():
    with open(LOCAL_FILE_PATH, 'rb') as encrypted_file:
        decrypted_data = gpg.decrypt_file(encrypted_file, passphrase=PGP_PASSPHRASE)

    if decrypted_data.ok:
        with open(DECRYPTED_FILE_PATH, 'w') as decrypted_file:
            decrypted_file.write(str(decrypted_data))
        print(f"File successfully decrypted to {DECRYPTED_FILE_PATH}")
    else:
        print(f"Failed to decrypt file. Status: {decrypted_data.status}")

# Main flow
if __name__ == '__main__':
    # Download the encrypted file from SFTP
    download_file_from_sftp()

    # Decrypt the downloaded file
    decrypt_pgp_file()

 

SFTP Isn’t Going Anywhere

The first time I worked with SFTP wasn’t in school but at my second job. Before I knew it, I was managing similar workflows at Facebook and with 3-4 different clients.

While it may not be the most exciting work, chances are you’ll run into SFTP at some point, and you might even need to create a generic script that can handle the different authentication and decryption methods. Sure, there are some out-of-the-box options, but having a custom script gives you more control.

Also! Don’t forget to check the articles below.

Alternatives to Azure Document Intelligence Studio: Exploring Powerful Document Analysis Tools

Measuring WAL Throughput in PostgreSQL: Step-by-Step Guide

How to cut exact scoring moments from Euro 2024 videos with SQL

Build A Data Stack That Lasts – How To Ensure Your Data Infrastructure Is Maintainable

Explaining Data Lakes, Data Lake Houses, Table Formats and Catalogs

How To Modernize Your Data Strategy And Infrastructure For 2025

How to Load Data into a Data Warehouse: Methods & Challenges

Leave a Reply

Your email address will not be published. Required fields are marked *