Google Cloud Storage (GCS)

Availability: Core Standard Plus Pro Enterprise Flex Self-Managed Enterprise PyAirbyte
Support Level: Airbyte
Connector Version: 0.10.7 (last updated 16 days ago)
CDK Version: 7.10.1
Sync Success Rate
Usage Rate
Definition ID: 2a8c41ae-8c23-4be0-a73f-2ab10ca1a820

This page contains the setup guide and reference information for the Google Cloud Storage (GCS) source connector.

info

Cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For more information, see the Google Cloud Storage pricing guide.

Prerequisites

A Google Cloud account with a service account that has read access to the target GCS bucket, or a Google account that can authenticate via OAuth
The service account must have the Storage Object Viewer (roles/storage.objectViewer) role on the bucket, or equivalent permissions (storage.objects.get and storage.objects.list). For more details, see the GCS IAM documentation.
A GCS bucket containing the files you want to replicate

Setup guide

Create a service account

First, select an existing project or create a new project in the Google Cloud Console:

Sign in to your Google account.
Go to the Service Accounts page.
Click Create service account.
Create a JSON key file for the service account. The contents of this file will be provided as the Service Account Information in the connector configuration.

Grant permission to GCS

Grant the service account read access to your target bucket. At minimum, assign the Storage Object Viewer (roles/storage.objectViewer) role. See Using IAM permissions for details.

Set up the connector in Airbyte

Log in to your Airbyte account.
Click Sources and then click + New source.
On the Set up the source page, select Google Cloud Storage (GCS) from the Source type dropdown.
Enter a name for the connector.
Select an authorization type:
- Authenticate via Google (OAuth): Click Sign in with Google and complete the authentication workflow.
- Service Account Information: Paste the service account JSON key into the Service Account Information field.
Enter your GCS bucket name in the Bucket field.
Add a stream:
1. Enter a Name for the stream.
2. In the Format box, use the dropdown menu to select the format of the files you'd like to replicate. Toggle the Optional fields button within the Format box to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the File Format Settings section.
3. Optionally, enter the Globs which dictate which files to sync. Globs use glob-style pattern matching to select specific files for replication. If you are replicating all the files within your bucket, use ** as the pattern. For more precise pattern matching options, refer to the Path Patterns section.
4. Optionally, enter an Input schema to enforce a specific schema. By default, this value is set to {} and will automatically infer the schema from the files you are replicating. For details on providing a custom schema, refer to the User Schema section.
Optionally, configure the Start Date parameter that marks a starting date and time in UTC for data replication. Any files that have not been modified since this specified date/time will not be replicated. Use the provided datepicker or enter the desired date programmatically in the format YYYY-MM-DDTHH:mm:ssZ. Leaving this field blank replicates data from all files that have not been excluded by the Path Pattern and Path Prefix.
Click Set up source and wait for the tests to complete.

File URLs

The GCS source connector uses signed URLs to work with files when authenticated with a service account, and gs:// URIs when authenticated via Google OAuth.

File URLs are stored in the connection state. If you change the authorization type and use incremental sync, the next sync will not use the old state and will re-read all files in full refresh mode. Subsequent syncs will be incremental as expected.

Path Patterns

tip

Path patterns use wcmatch.glob syntax with GLOBSTAR and SPLIT flags enabled.

This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:

Referencing many files with just one pattern, e.g. ** would indicate every file in the bucket.
Referencing future files that don't exist yet and therefore don't have a specific path.

You must provide a path pattern. You can also provide multiple patterns split with | for more complex directory layouts.

Each path pattern is a reference from the root of the bucket, so don't include the bucket name itself in the patterns.

Some example patterns:

** : match everything.
**/*.csv : match all files with specific extension.
myFolder/**/*.csv : match all CSV files anywhere under myFolder.
*/** : match everything at least one folder deep.
*/*/*/** : match everything at least three folders deep.
**/file.*|**/file : match every file called "file" with any extension or no extension.
x/*/y/* : match all files that sit in sub-folder x, then any folder, then folder y.
**/prefix*.csv : match all CSV files with specific prefix.
**/prefix*.parquet : match all Parquet files with specific prefix.

Let's look at a specific example, matching the following folder layout (MyFolder is the folder specified in the connector config as the root folder, which the patterns are relative to):

MyFolder
    -> log_files
    -> some_table_files
        -> part1.csv
        -> part2.csv
    -> images
    -> more_table_files
        -> part3.csv
    -> extras
        -> misc
            -> another_part1.csv

We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:

We could pick up every CSV file called "partX" with the single pattern **/part*.csv.
To be a bit more robust, we could use the dual pattern some_table_files/*.csv|more_table_files/*.csv to pick up relevant files only from those exact folders.
We could achieve the above in a single pattern by using the pattern *table_files/*.csv. This could however cause problems in the future if new unexpected folders started being created.
We can also recursively wildcard, so adding the pattern extras/**/*.csv would pick up any CSV files nested in folders below "extras", such as "extras/misc/another_part1.csv".

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.

User Schema

When using the Avro, JSONL, CSV, or Parquet format, you can provide a schema to use for the output stream. Note that this doesn't apply to the experimental Document file type format.

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:

You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the _ab_additional_properties map.
Your initial dataset is quite small in terms of number of records, and you think the automatic type inference from this sample might not be representative of the data in the future.
You want to purposely define types for every column.
You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the _ab_additional_properties map.

The schema must be provided as valid JSON as a map of {"column": "datatype"} where each datatype is one of:

string
number
integer
object
array
boolean
null

For example:

{"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
{"username": "string", "friends": "array", "information": "object"}

File Format Settings

Avro

Convert Double Fields to Strings: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.

CSV

Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.

Header Definition: How headers will be defined. User Provided assumes the CSV does not have a header row and uses the headers provided and Autogenerated assumes the CSV does not have a header row and the CDK will generate headers using for f{i} where i is the index starting from 0. Else, the default behavior is to use the header from the CSV file. If a user wants to autogenerate or provide column names for a CSV having headers, they can set a value for the "Skip rows before header" option to ignore the header row.
Delimiter: Even though CSV is an acronym for Comma Separated Values, it is used more generally as a term for flat file data that may or may not be comma separated. The delimiter field lets you specify which character acts as the separator. To use tab-delimiters, you can set this value to \t. By default, this value is set to ,.
Double Quote: This option determines whether two quotes in a quoted CSV value denote a single quote in the data. Set to True by default.
Encoding: Some data may use a different character set, typically when different alphabets are involved. See the list of allowable encodings here. By default, this is set to utf8.
Escape Character: An escape character can be used to prefix a reserved character and ensure correct parsing. A commonly used character is the backslash (\). For example, given the following data:

Product,Description,Price
Jeans,"Navy Blue, Bootcut, 34\"",49.99

The backslash (\) is used directly before the second double quote (") to indicate that it is not the closing quote for the field, but rather a literal double quote character that should be included in the value (in this example, denoting the size of the jeans in inches: 34" ).

Leaving this field blank (default option) will disallow escaping.

False Values: A set of case-sensitive strings that should be interpreted as false values.
Null Values: A set of case-sensitive strings that should be interpreted as null values. For example, if the value 'NA' should be interpreted as null, enter 'NA' in this field.
Quote Character: In some cases, data values may contain instances of reserved characters, like a comma if that's the delimiter. CSVs can handle this by wrapping a value in defined quote characters so that on read it can parse it correctly. By default, this is set to ".
Skip Rows After Header: The number of rows to skip after the header row.
Skip Rows Before Header: The number of rows to skip before the header row.
Strings Can Be Null: Whether strings can be interpreted as null values. If true, strings that match the null_values set will be interpreted as null. If false, strings that match the null_values set will be interpreted as the string itself.
True Values: A set of case-sensitive strings that should be interpreted as true values.

JSONL

Schemaless: When enabled, syncs will not validate or structure records against the stream's schema.

Parquet

Convert Double Fields to Strings: Whether to convert double fields to strings. This is recommended if you have decimal numbers with a high degree of precision because there can be a loss precision when handling floating point numbers.

Unstructured document format

Parsing Strategy: The strategy used to parse documents. fast extracts text directly from the document which doesn't work for all files. ocr_only is more reliable, but slower. hi_res is the most reliable, but requires an API key and a hosted instance of unstructured and can't be used with local mode. See the unstructured.io documentation for more details.
Processing: Processing configuration. Options:
- Local - Process files locally, supporting fast and ocr modes. This is the default option.
- Via API - Process files via an API, using the hi_res mode. This option is useful for increased performance and accuracy, but requires an API key and a hosted instance of unstructured.
Skip Unprocessable Files: If true, skip files that cannot be parsed and pass the error message along as the _ab_source_file_parse_error field. If false, fail the sync.
Schemaless: When enabled, syncs will not validate or structure records against the stream's schema.

Excel

Schemaless: When enabled, syncs will not validate or structure records against the stream's schema.

Supported sync modes

The Google Cloud Storage (GCS) source connector supports the following sync modes:

Feature	Supported?	Notes
Full Refresh Sync	Yes
Incremental Sync	Yes

Supported file formats

This connector supports the following file formats:

Avro
CSV
Excel
JSONL
Parquet
Unstructured document format

The connector also supports reading files compressed with gzip (.gz) or bzip2 (.bz2), and can extract and process files from ZIP (.zip) archives. ZIP archives containing files in any of the supported formats are automatically extracted during sync.

Reference

Config fields reference

Field

Type

Property name

string

bucket

object

credentials

array<object>

streams

object

delivery_method

string

start_date

Changelog

Expand to review

Version	Date	Pull Request	Subject
0.10.9	2026-03-19	74779	Fix ZIP file detection for files with compound extensions (e.g. `.csv.zip`)
0.10.8	2026-03-19	74781	Fix records quadratic duplication when extracting ZIP archives
0.10.7	2026-03-03	70287	Update dependencies
0.10.6	2026-02-13	73332	Fix zip file extraction failing with `DeliverRawFiles has no attribute delivery_type` error
0.10.5	2025-11-25	69913	Update dependencies
0.10.4	2025-11-18	69426	Update dependencies
0.10.3	2025-11-11	69270	Update dependencies
0.10.2	2025-11-04	69159	Update dependencies
0.10.1	2025-10-29	69054	Update dependencies
0.10.0	2025-10-27	68619	Update dependencies
0.9.2	2025-10-21	68330	Update dependencies
0.9.1	2025-10-14	68032	Update dependencies
0.9.0	2025-10-07	67340	Promoting release candidate 0.9.0-rc.1 to a main version.
0.9.0-rc.1	2025-10-06	66671	Update to latest airbyte cdk
0.8.31	2025-09-30	66303	Update dependencies
0.8.30	2025-09-09	66088	Update dependencies
0.8.29	2025-08-23	65389	Update dependencies
0.8.28	2025-08-16	64980	Update dependencies
0.8.27	2025-08-09	64627	Update dependencies
0.8.26	2025-08-02	64367	Update dependencies
0.8.25	2025-07-26	63951	Update dependencies
0.8.24	2025-07-19	63564	Update dependencies
0.8.23	2025-07-12	62985	Update dependencies
0.8.22	2025-07-05	62822	Update dependencies
0.8.21	2025-06-28	61274	Update dependencies
0.8.20	2025-05-27	60868	Update dependencies
0.8.19	2025-05-24	60392	Update dependencies
0.8.18	2025-05-10	60012	Update dependencies
0.8.17	2025-05-03	59443	Update dependencies
0.8.16	2025-04-26	58915	Update dependencies
0.8.15	2025-04-19	58312	Update dependencies
0.8.14	2025-04-12	57772	Update dependencies
0.8.13	2025-04-05	57213	Update dependencies
0.8.12	2025-03-29	56520	Update dependencies
0.8.11	2025-03-22	55956	Update dependencies
0.8.10	2025-03-08	55314	Update dependencies
0.8.9	2025-03-01	54973	Update dependencies
0.8.8	2025-02-25	54677	Fix io.UnsupportedOperation: underlying stream is not seekable
0.8.7	2025-02-22	54458	Update dependencies
0.8.6	2025-02-15	53712	Update dependencies
0.8.5	2025-02-08	53365	Update dependencies
0.8.4	2025-02-01	52379	Update dependencies
0.8.3	2025-01-18	49174	Update dependencies
0.8.2	2024-11-25	48647	Starting with this version, the Docker image is now rootless. Please note that this and future versions will not be compatible with Airbyte versions earlier than 0.64
0.8.1	2024-10-28	45923	Update logging
0.8.0	2024-10-28	45414	Add support for OAuth authentication
0.7.4	2024-10-12	46858	Update dependencies
0.7.3	2024-10-05	46458	Update dependencies
0.7.2	2024-09-28	46178	Update dependencies
0.7.1	2024-09-24	45850	Add integration tests
0.7.0	2024-09-24	45671	Add .zip files support
0.6.9	2024-09-21	45798	Update dependencies
0.6.8	2024-09-19	45092	Update CDK v5; Fix OSError not raised in stream_reader.open_file
0.6.7	2024-09-14	45492	Update dependencies
0.6.6	2024-09-07	45232	Update dependencies
0.6.5	2024-08-31	45010	Update dependencies
0.6.4	2024-08-27	44796	Fix empty list of globs when prefix empty
0.6.3	2024-08-26	44781	Set file signature URL expiration limit default to max
0.6.2	2024-08-24	44733	Update dependencies
0.6.1	2024-08-17	44285	Update dependencies
0.6.0	2024-08-15	44015	Add support for all FileBasedSpec file types
0.5.0	2024-08-14	44070	Update CDK v4 and Python 3.10 dependencies
0.4.15	2024-08-12	43733	Update dependencies
0.4.14	2024-08-10	43512	Update dependencies
0.4.13	2024-08-03	43236	Update dependencies
0.4.12	2024-07-27	42693	Update dependencies
0.4.11	2024-07-20	42312	Update dependencies
0.4.10	2024-07-13	41865	Update dependencies
0.4.9	2024-07-10	41430	Update dependencies
0.4.8	2024-07-09	41148	Update dependencies
0.4.7	2024-07-06	41015	Update dependencies
0.4.6	2024-06-26	40540	Update dependencies
0.4.5	2024-06-25	40391	Update dependencies
0.4.4	2024-06-24	40234	Update dependencies
0.4.3	2024-06-22	40089	Update dependencies
0.4.2	2024-06-06	39255	[autopull] Upgrade base image to v1.2.2
0.4.1	2024-05-29	38696	Avoid error on empty stream when running discover
0.4.0	2024-03-21	36373	Add Gzip and Bzip compression support. Manage dependencies with Poetry.
0.3.7	2024-02-06	34936	Bump CDK version to avoid missing SyncMode errors
0.3.6	2024-01-30	34681	Unpin CDK version to make compatible with the Concurrent CDK
0.3.5	2024-01-30	34661	Pin CDK version until upgrade for compatibility with the Concurrent CDK
0.3.4	2024-01-11	34158	Fix issue in stream reader for document file type parser
0.3.3	2023-12-06	33187	Bump CDK version to hide source-defined primary key
0.3.2	2023-11-16	32608	Improve document file type parser
0.3.1	2023-11-13	32357	Improve spec schema
0.3.0	2023-10-11	31212	Migrated to file based CDK
0.2.0	2023-06-26	27725	License Update: Elv2
0.1.0	2023-02-16	23186	New Source: GCS

Prerequisites​

Setup guide​

Create a service account​

Grant permission to GCS​

Set up the connector in Airbyte​

File URLs​

Path Patterns​

User Schema​

File Format Settings​

Avro​

CSV​

JSONL​

Parquet​

Unstructured document format​

Excel​

Supported sync modes​

Supported file formats​

Reference​