What Is BigQuery And How Do You Load Data Into It?

research@theseattledataguy.com April 22, 2025 data engineering consulting 0

If you work in data, then you’ve likely used BigQuery and you’ve likely used it without really thinking about how it operates under the hood. On the surface BigQuery is Google Cloud’s fully-managed, serverless data warehouse.

It’s the Redshift of GCP except we like it a little more.

The question becomes, how does it work? There is a lot going on under the hood. In this article we’ll discuss how BigQuery works, how you can load data into it, and more.

What Is BigQuery

Before diving into how BigQuery works, lets talk about what it is. BigQuery is a cloud-based data warehouse. It helps you store, manage, and analyze large amounts of data using SQL.

If you’re trying to help your boss or the business understand what it does, you can tell them that you can use BigQuery to:

Analyze customer behavior
Power dashboards and reports
Join data from different sources

It’s used by data teams, analysts, engineers, and businesses of all sizes, especially those already working in Google Cloud.

How BigQuery Works: Under the Hood

BigQuery is built on top of Dremel, Google’s proprietary query execution engine designed for fast, dynamic analysis of massive datasets. Understanding the basics of how it works helps explain why BigQuery behaves the way it does, particularly when it comes to performance, scalability, and cost.

How does it work?

Here is a section from GCP’s documentation.

“Dremel turns your SQL query into an execution tree. The leaves of the tree it calls ‘slots’, and do the heavy lifting of reading the data from Colossus and doing any computation necessary. In the example from the last post, the slots are reading 100 Billion rows and doing a regular expression check on each one.” – GCP

Breaking it down further.

The leaves of the tree are called slots, which:

Read data from Colossus(which we will get to).
Perform the main computation (e.g., regex on 100B rows).

Then:

The branches of the tree are called mixers, which, perform data aggregation.
The shuffle layer in between rapidly transfers data using Google’s Jupiter network.
The entire execution (slots and mixers) is managed by Borg, which allocates hardware resources.

That’s the high level, but let’s dig deeper into the storage.

Storage

BigQuery relies on Colossus, Google’s distributed file system, as its storage backbone. Each Google data center operates its own Colossus cluster, which collectively provide enough disks for BigQuery users to access thousands of dedicated disks at a time. Colossus handles critical tasks like replication, recovery from disk failures, and distributed management, ensuring there’s no single point of failure. Despite using disk-based storage, Colossus is fast enough to deliver performance comparable to many in-memory systems, thanks to its massive parallelism and optimization. This enables BigQuery to operate at scale without relying on expensive compute-bound architectures.

On top of Colossus, BigQuery stores data using the ColumnIO format, a highly optimized columnar storage layout designed specifically for large-scale analytics. Unlike traditional row-based databases, columnar storage organizes data by column rather than by row. This allows BigQuery to scan only the columns needed for a query, reducing the amount of data read and significantly improving performance. Compression is also applied at the column level, further enhancing efficiency.

Data Processing

When a query is submitted, BigQuery uses a massively parallel processing (MPP) engine to break the work into stages and distribute them across many worker nodes. Each worker is responsible for scanning a subset of the data and processing it in parallel. The intermediate results flow through a tree-like execution structure inspired by Dremel. At the leaves of this tree are “slots,” which do the heavy lifting of reading and computing on data. Higher up are “mixers,” which aggregate intermediate results. These components are connected by a fast shuffle layer powered by Google’s Jupiter network, and everything is orchestrated by Borg, Google’s cluster management system.

This architecture enables BigQuery to scale to dozens of petabytes of data seamlessly, offering high performance and flexibility without requiring users to over-provision compute resources. It’s designed from the ground up to make querying massive datasets feel as fast and simple as querying a small table on your laptop.

Separation of Storage and Compute

BigQuery separates storage from compute. Data is stored in Google’s distributed storage system, and compute resources are allocated dynamically when queries run. This design allows BigQuery to scale effortlessly, users don’t need to manage clusters, provision servers, or worry about over- or under-allocating resources.

Query Optimization and Caching

BigQuery includes a built-in query optimizer that determines the most efficient way to execute a query across the distributed system. Additionally, it can cache the results of previous queries. If the same query is run again on the same data, BigQuery will return cached results at no cost.

Streaming and Batch Ingest

BigQuery supports both batch and streaming ingestion. Streaming inserts make it possible to analyze data seconds after it arrives, while batch loading supports higher throughput for large-scale data movement.

BigQuery Pricing

BigQuery pricing is where a lot of folks get nervous. Ok a lot of cloud pricing makes people nervous. Just look at BigQuery’s pricing page.

You Only Pay for What You Use…Mostly

BigQuery’s pricing is based on two core areas: compute (when you run queries) and storage (when you store data). That’s it. No virtual machines to manage. No complex cluster configurations. You just load your data and query it.

But how you’re charged for those two things depends on which pricing model you’re using.

Compute Pricing: On-Demand vs. Capacity

The Default: On-Demand Pricing

Many teams start with on-demand pricing, which means you’re charged based on the amount of data your query scans.

As of 2024, the cost is $6.25 per terabyte (TiB) of data scanned. The first 1 TiB per month is free, which is great for teams getting started.

And here’s the catch: BigQuery charges based on the columns you SELECT and the tables you touch, regardless of how small the result set is. Even if you use LIMIT 1, BigQuery will still scan the full dataset unless you’ve properly partitioned and clustered it.

There’s also a minimum charge: every query processes at least 10 MB per table, so very small queries might still cost something.

For Predictable Workloads: Capacity Pricing

If you’re running lots of queries or you want more cost predictability, BigQuery offers capacity-based pricing, where you buy compute power in the form of slots.

A slot is basically a virtual CPU that BigQuery uses to run your queries. You can choose how many you want, and BigQuery will allocate them to your jobs.

Pricing starts at $0.04 per slot-hour in the Standard Edition. For higher performance and advanced features, Enterprise and Enterprise Plus editions cost more but offer volume discounts:

Enterprise: $0.06/slot-hour (down to $0.036 for 3-year commits)
Enterprise Plus: $0.10/slot-hour (down to $0.06 with commitment)

Capacity pricing is especially useful when you want guaranteed performance or you’re running concurrent workloads like dashboards, ML models, or heavy daily pipelines.

Storage Pricing: Where Your Data Lives

Storage in BigQuery is separate from compute, and it’s pretty affordable.

There are two types:

Active storage – tables that have been modified in the last 90 days.
Priced at $0.02 per GiB per month.
Long-term storage – tables that haven’t been changed in 90 days.
Priced at $0.01 per GiB per month, a 50% discount, automatically applied.

The best part? There’s no difference in performance between active and long-term storage. So if your data is stable, you get a discount without any drawbacks.

The first 10 GiB per month is free, and pricing is prorated by the second, so you only pay for what you actually store.

What Else Might Cost You?

BigQuery has other services that might come into play depending on how you’re using the platform:

Streaming Inserts

If you’re streaming real-time data into BigQuery (rather than batch loading), you’ll pay:

$0.01 per 200 MB with the older insertAll API
$0.025 per GiB with the newer Storage Write API (first 2 TiB/month are free)

Streaming Reads (Storage Read API)

For reading data via BigQuery’s high-throughput read API (often used for real-time dashboards):

$1.10 per TiB read
First 300 TiB/month is free

BigQuery ML

If you’re using BigQuery to train machine learning models:

Most training operations cost $312.50 per TiB scanned
Predictions, evaluations, and inspections are billed like regular queries at $6.25/TiB
AutoML models and DNNs may also incur Vertex AI costs, depending on how they’re configured

Omni (Multi-cloud BigQuery)

Running BigQuery on AWS or Azure? Expect slightly higher on-demand query costs, ranging from $7.82 to $10.55 per TiB, depending on region, plus additional fees for cross-cloud data transfers.

Built-in Safeguards (Use Them!)

Worried about runaway costs? BigQuery gives you guardrails:

Set maximum bytes billed on a per-query basis
Apply user-level or project-level custom cost controls
Monitor spend in the Cloud Billing dashboard

You can also preview how much a query would cost using a dry run, which is a lifesaver.

Understanding BigQuery Slots

What Is a Slot?

A slot is a unit of computational capacity in BigQuery. Every query you run gets broken down into smaller tasks, and those tasks are executed using slots. Each slot represents the ability to process a portion of your query for a fixed period of time (roughly equivalent to one virtual CPU’s worth of work).

BigQuery manages slot allocation behind the scenes if you’re on on-demand pricing…you don’t see or think about slots directly. But if you’re on flat-rate or reservations-based pricing, slots become a critical part of performance management.

Why Slots Matter

Performance: More slots = more parallelism. If you have a large query and limited slots, it may take longer to run because tasks are queued or executed sequentially.
Concurrency: If many users or scheduled jobs are running at once, more slots allow those queries to run simultaneously without slowing each other down.
Cost Control: With flat-rate pricing, you’re buying a fixed number of slots for a predictable monthly fee. This is often more cost-effective at scale.

How to Load Data into BigQuery

BigQuery is great for querying data. But first, you need to get that data in. The good news: loading data into BigQuery is flexible. Whether you’re working with CSV files, streaming real-time events, or syncing data from other systems, there’s a workflow that fits.

Of course, it also depends where your data lives. Is it in Google Cloud Storage, an SFTP, Hubspot, etc. But for now I am just going to show a few options.

Here’s a breakdown of common ways to load data into BigQuery.

1)Upload a File (Fastest Way to Get Started)

If you’re just getting started with BigQuery and want to try it out with your own data, the easiest way to begin is by uploading a local file directly through the web interface. No code. No infrastructure. Just drag, drop, and query.

Let’s walk through how it works.

Step-by-Step: Uploading a File

Open the BigQuery Console
Head to console.cloud.google.com/bigquery and select your project.
Choose or Create a Dataset
BigQuery organizes data into datasets, which are like folders for your tables. If you don’t already have one, click “Create Dataset,” give it a name, and choose a data location (e.g., us or europe-west1).
Click “Create Table”
You’ll see options for where your data is coming from. For a local upload:
- Set the Source to “Upload”
- Click “Browse” and select your file (CSV and JSON are the most common)
Configure Destination Table
- Choose a table name
- Make sure the destination dataset is selected
- Decide whether to auto-create the table or append/overwrite an existing one
Set Schema
- You can let BigQuery auto-detect the schema (works well for CSVs with headers and structured JSON)
- Or define it manually if you need control over data types (especially helpful for larger or more complex datasets)
Click “Create Table”

Within seconds(ok it in theory could be more depending on the size of your data, bandwidth, etc), your data is loaded and ready to query.

2)Load from Google Cloud Storage (GCS)

For larger datasets or recurring loads, Google Cloud Storage (GCS) is the standard choice.

You upload your file to a GCS bucket, then tell BigQuery where to find it. You can do this through the UI or via SQL:

LOAD DATA INTO my_dataset.my_table
FROM FILES (
format = ‘CSV’,
uris = [‘gs://my-bucket/myfile.csv’]);

GCS supports all the usual file formats, CSV, JSON, Avro, Parquet, ORC, and works well when paired with tools like gsutil, Dataflow, or external data sources.

You could also use Python. I am making a video and including code there.

3)Use an ETL Tool or Scheduler

If your team is looking to automate your data warehouse ingestion, and not have to program it, then using ETL tools is a great approach. You can use these tools to load data into BigQuery easily.

These tools handle retries, schema mapping, schema changes, and scheduling, so you can keep data fresh without manual effort.

For example, the tools our team has used to help set-up our clients data infrastructure include options such as:

Estuary – Ideal for both batch and real-time data loading. Estuary supports both streaming and batch workloads and is well-suited for syncing systems like PostgreSQL, MySQL, and APIs(think Netsuite, Salesforce, Marketo, etc) into BigQuery with minimal latency.
Portable – Portable is focused on long-tail data integrations, the obscure SaaS tools and platforms that most ELT tools don’t cover. While the big players have 100–200 connectors, Portable offers hundreds more, often building new ones on-demand for customers.
dlt – dlt is aPython-based open-source ELT framework that’s ideal for teams who want more control over their data pipelines, but don’t want to build everything from scratch. Unlike GUI-based tools, dlt lets you define pipelines as code, version them in Git, and deploy them as part of your existing CI/CD workflows.

Final Thoughts

BigQuery can seem intimidating at first, serverless infrastructure, columnar storage, slots, streaming, and if you dig under the hood you find terms like Dremel flying around. But much of that is abstracted away. You can focus more on your data workflows instead of the data warehouse.

And as you scale, BigQuery scales with you, offering deep performance tuning, advanced modeling tools, and tight integrations with the rest of Google Cloud.

Most importantly, it’s okay not to know everything from day one. Everyone has had that “wait, what exactly is a slot?” moment. This post was meant to give you a solid foundation, and clear up some of those questions you might not have wanted to ask out loud.

So go ahead: load some data, run some queries, and don’t be afraid to break things (or at least hit the 1TB free tier first). BigQuery was built to help you move fast and query fast.

Also! Don’t forget to check the articles below.

ETLs vs ELTs: Why are ELTs Disrupting the Data Market? – Data Engineering Consulting

NetSuite to Snowflake Integration: Ultimate Guide to 2 Effective Methods

Bridging the Gap: A Data Leader’s Guide To Helping Your Data Team Create Next Level Analysis

The Data Engineer’s Guide to ETL Alternatives

What Is Snowflake – Breaking Down What Snowflake Is, How Snowflake Credits Work And More

Explaining Data Lakes, Data Lake Houses, Table Formats and Catalogs

How to cut exact scoring moments from Euro 2024 videos with SQL

How To Modernize Your Data Strategy And Infrastructure For 2025

Big Data data engineering

What Is BigQuery And How Do You Load Data Into It?