New Advanced PDF + OCR Interface for Document AI

Set up Google Cloud Storage

Dynamically import tasks and export annotations to Google Cloud Storage (GCS) buckets in Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Configure access to your Google Cloud Storage bucket

First, review the information in Cloud storage for projects and Secure access to cloud storage.

Then you will need to complete the following prerequisites:

1. Enable programmatic access to your bucket

See Cloud Storage Client Libraries in the Google Cloud Storage documentation for how to set up access to your GCS bucket.

2. Set up authentication to your bucket

Your account must have the Service Account Token Creator and Storage Object Viewer roles and storage.buckets.get access permission. See Setting up authentication and IAM permissions for Cloud Storage in the Google Cloud Storage documentation.

3. Configure CORS

Set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. See Configuring cross-origin resource sharing (CORS) in the Google Cloud User Guide.

note

This is only required if you are using pre-signed URLs. If you are using proxying, you do not have to configure CORS. For more information, see Pre-signed URLs vs Storage proxies.

Use or modify the following example:

echo '[
   {
      "origin": ["*"],
      "method": ["GET"],
      "responseHeader": ["Content-Type","Access-Control-Allow-Origin"],
      "maxAgeSeconds": 3600
   }
]' > cors-config.json

Replace YOUR_BUCKET_NAME with your actual bucket name in the following command to update CORS for your bucket:

gsutil cors set cors-config.json gs://YOUR_BUCKET_NAME

Google Cloud Storage

Before you begin:

Google Application Credentials

You will need to provide Google Application Credentials. These will be a JSON file that you input while setting up your storage.

  1. From the Google Cloud Console, go to IAM & Admin > Service Accounts.
  2. Select the specific service account you need credentials for. If you don’t have one, create a new one.
  3. In the service account details, go to the Keys tab and click Add Key > Create new key.
  4. Select the JSON key type and click Create. The JSON file will be generated and automatically downloaded to your computer.

See also:

note

If you're using a service account to authorize access to the Google Cloud Platform, make sure to activate it. See gcloud auth activate-service-account.

Create a source storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Source Storage.

Select Google Cloud Storage and click Next.

Configure Connection

Complete the following fields and then click Test connection:

Field Description
Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Google Application Credentials Enter the JSON file with the GCS credentials you created to manage authentication for your bucket.

On-prem users: Alternatively, you can use the GOOGLE_APPLICATION_CREDENTIALS environment variable and/or set up Application Default Credentials, so that users do not need to configure credentials manually.

See Application Default Credentials for enhanced security below.
Google Project ID Enter the ID of your Google project in which the bucket is located (for example, my-label-studio-project).

If you're unsure, you can find this in Google Cloud Console under IAM & Admin > Settings.
Use pre-signed URLs (On) /
Proxy through the platform (Off)
This determines how data from your bucket is loaded:
  • Use pre-signed URLs: Label Studio generates time-limited HTTPS links directly to your S3/GCS/Azure objects and redirects the browser there (HTTP 303), so annotators’ browsers download media straight from cloud storage. This is usually faster and scales better, but requires correct CORS and presign permissions on the bucket. It also means traffic flows from browser to storage, not through Label Studio.
  • Proxy through the platform – The backend downloads the file from cloud storage and streams it to the browser, so all media traffic passes through the Label Studio server. This keeps data fully inside the Label Studio/network boundary, enforces task-level access checks on every request, and avoids CORS/presign setup, but uses more Label Studio worker resources and can be slightly slower.

For more information, see Pre-signed URLs vs Storage proxies.
Expire pre-signed URLs (minutes) Control how long pre-signed URLs remain valid.

Import Settings & Preview

Complete the following fields and then click Load preview to ensure you are syncing the correct data:

Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Import Method Select whether you want create a task for each file in your bucket or whether you would like to use a JSON/JSONL/Parquet file to define the data for each task.
File Name Filter Specify a regular expression to filter bucket objects. Use .* to collect all objects.
Scan all sub-folders Enable this option to perform a recursive scan across subfolders within your container.

Review & Confirm

If everything looks correct, click Save & Sync to sync immediately, or click Save to save your settings and sync later.

Tip

You can also use the API to sync import storage.

Create a target storage connection

From Label Studio, open your project and select Settings > Cloud Storage > Add Target Storage.

Select Google Cloud Storage and click Next.

Complete the following fields:

Storage Title Enter a name to identify the storage connection.
Bucket Name Enter the name of your GCS bucket.
Bucket Prefix Optionally, enter the directory name within your bucket that you would like to use. For example, data-set-1 or data-set-1/subfolder-2.
Google Application Credentials Enter the JSON file with the GCS credentials you created to manage authentication for your bucket.

On-prem users: Alternatively, you can use the GOOGLE_APPLICATION_CREDENTIALS environment variable and/or set up Application Default Credentials, so that users do not need to configure credentials manually.

See Application Default Credentials for enhanced security below.
Google Project ID Enter the ID of your Google project in which the bucket is located (for example, my-label-studio-project).

If you're unsure, you can find this in Google Cloud Console under IAM & Admin > Settings.
Can delete objects from storage Enable this option if you want to delete annotations stored in the bucket when they are deleted in Label Studio. Your credentials must include the ability to delete bucket objects.

After adding the storage, click Sync.

Tip

You can also use the API to sync export storage.

Application Default Credentials for enhanced security for GCS

If you use Label Studio on-premises with Google Cloud Storage, you can set up Application Default Credentials to provide cloud storage authentication globally for all projects, so users do not need to configure credentials manually.

The recommended way to to do this is by using the GOOGLE_APPLICATION_CREDENTIALS environment variable. For example:

export GOOGLE_APPLICATION_CREDENTIALS=json-file-with-GCP-creds-23441-8f8sd99vsd115a.json

Google Cloud Storage with Workload Identity Federation (WIF)

In Label Studio Enterprise, you can use Workload Identity Federation (WIF) pools with Google Cloud Storage.

Unlike with application credentials, WIF allows you to use temporary credentials. Each time you make a request to GCS, Label Studio connects to your identity pool to request temporary credentials.

For more information, see Google Cloud Storage with Workload Identity Federation (WIF) in our Enterprise documentation.

Add storage with the Label Studio API

You can also use the API to programmatically create connections. See our API documentation.

IP filtering for enhanced security for GCS

Google Cloud Storage offers bucket IP filtering as a powerful security mechanism to restrict access to your data based on source IP addresses. This feature helps prevent unauthorized access and provides fine-grained control over who can interact with your storage buckets.

Read more about Source storage behind your VPC.

Common Use Cases:

  • Restrict bucket access to only your organization’s IP ranges
  • Allow access only from specific VPC networks in your infrastructure
  • Secure sensitive data by limiting access to known IP addresses
  • Control access for third-party integrations by whitelisting their IPs
How to Set Up IP Filtering
  1. First, create your GCS bucket through the console or CLI
  2. Create a JSON configuration file to define IP filtering rules. You have two options: For public IP ranges:
    {
      "mode": "Enabled", 
      "publicNetworkSource": {
        "allowedIpCidrRanges": [
          "xxx.xxx.xxx.xxx", // Your first IP address
          "xxx.xxx.xxx.xxx", // Your second IP address
          "xxx.xxx.xxx.xxx/xx" // Your IP range in CIDR notation
        ]
      }
    }

For VPC network sources:

{
  "mode": "Enabled",
  "vpcNetworkSources": [
    {
      "network": "projects/PROJECT_ID/global/networks/NETWORK_NAME",
      "allowedIpCidrRanges": [
        RANGE_CIDR
      ]
    }
  ]
}
  1. Apply the IP filtering rules to your bucket using the following command:

    gcloud alpha storage buckets update gs://BUCKET_NAME --ip-filter-file=IP_FILTER_CONFIG_FILE
  2. To remove IP filtering rules when no longer needed:

    gcloud alpha storage buckets update gs://BUCKET_NAME --clear-ip-filter

Limitations to Consider

  • Maximum of 200 IP CIDR blocks across all rules
  • Maximum of 25 VPC networks in the IP filter rules
  • Not supported for dual-regional buckets
  • May affect access from certain Google Cloud services

Read more about GCS IP filtering