Use Cloud Storage for Bulk Operations

On this page Carat arrow pointing down
Warning:
Cockroach Labs will stop providing Assistance Support for v21.2 on May 16, 2023. Prior to that date, upgrade to a more recent version to continue receiving support. For more details, see the Release Support Policy.

CockroachDB constructs a secure API call to the cloud storage specified in a URL passed to one of the following statements:

Tip:

We strongly recommend using cloud/remote storage.

URL format

URLs for the files you want to import must use the format shown below. For examples, see Example file URLs.

[scheme]://[host]/[path]?[parameters]
Location Scheme Host Parameters
Amazon s3 Bucket name AUTH: optional implicit or specified (default: specified); AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

AWS_SESSION_TOKEN: (optional) For more information, see Authentication - Amazon S3.

S3_STORAGE_CLASS: (optional) Specify the Amazon S3 storage class for created objects. Note that Glacier Flexible Retrieval and Glacier Deep Archive are not compatible with incremental backups. Default: STANDARD.
Azure azure Storage container AZURE_ACCOUNT_NAME: The name of your Azure account.

AZURE_ACCOUNT_KEY: Your Azure account key. You must url encode your Azure account key before authenticating to Azure Storage. For more information, see Authentication - Azure Storage.

AZURE_ENVIRONMENT: (optional) The Azure environment that the storage account belongs to. The accepted values are: AZURECHINACLOUD, AZUREGERMANCLOUD, AZUREPUBLICCLOUD, and AZUREUSGOVERNMENTCLOUD. These are cloud environments that meet security, compliance, and data privacy requirements for the respective instance of Azure cloud. If the parameter is not specified, it will default to AZUREPUBLICCLOUD.
Google Cloud gs Bucket name AUTH: implicit, or specified (default: specified); CREDENTIALS

For more information, see Authentication - Google Cloud Storage.
HTTP http Remote host N/A

For more information, see Authentication - HTTP.
NFS/Local 1 nodelocal nodeID or self 2 (see Example file URLs) N/A
S3-compatible services s3 Bucket name
Warning:
While Cockroach Labs actively tests Amazon S3, Google Cloud Storage, and Azure Storage, we do not test S3-compatible services (e.g., MinIO, Red Hat Ceph).


AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION 3 (optional), AWS_ENDPOINT

For more information, see Authentication - S3-compatible services.
Tip:

The location parameters often contain special characters that need to be URI-encoded. Use Javascript's encodeURIComponent function or Go language's url.QueryEscape function to URI-encode the parameters. Other languages provide similar functions to URI-encode special characters.

Note:

You can disable the use of implicit credentials when accessing external cloud storage services for various bulk operations by using the --external-io-disable-implicit-credentials flag.

1 The file system backup location on the NFS drive is relative to the path specified by the --external-io-dir flag set while starting the node. If the flag is set to disabled, then imports from local directories and NFS drives are disabled.

2 Using a nodeID is required and the data files will be in the extern directory of the specified node. In most cases (including single-node clusters), using nodelocal://1/<path> is sufficient. Use self if you do not want to specify a nodeID, and the individual data files will be in the extern directories of arbitrary nodes; however, to work correctly, each node must have the --external-io-dir flag point to the same NFS mount or other network-backed, shared storage.

3 The AWS_REGION parameter is optional since it is not a required parameter for most S3-compatible services. Specify the parameter only if your S3-compatible service requires it.

Example file URLs

Example URLs for BACKUP, RESTORE, changefeeds, or EXPORT given a bucket or container name of acme-co and an employees subdirectory:

Location Example
Amazon S3 s3://acme-co/employees?AWS_ACCESS_KEY_ID=123&AWS_SECRET_ACCESS_KEY=456
Azure azure://acme-co/employees?AZURE_ACCOUNT_NAME=acme-co&AZURE_ACCOUNT_KEY=url-encoded-123
Google Cloud gs://acme-co/employees?AUTH=specified&CREDENTIALS=encoded-123
NFS/Local nodelocal://1/path/employees, nodelocal://self/nfsmount/backups/employees 2
Note:

Cloud storage sinks (for changefeeds) only work with JSON and emits newline-delimited JSON files.

Example URLs for IMPORT given a bucket or container name of acme-co and a filename of employees:

Location Example
Amazon S3 s3://acme-co/employees.sql?AWS_ACCESS_KEY_ID=123&AWS_SECRET_ACCESS_KEY=456
Azure azure://acme-co/employees.sql?AZURE_ACCOUNT_NAME=acme-co&AZURE_ACCOUNT_KEY=url-encoded-123
Google Cloud gs://acme-co/employees.sql?AUTH=specified&CREDENTIALS=encoded-123
HTTP http://localhost:8080/employees.sql
NFS/Local nodelocal://1/path/employees, nodelocal://self/nfsmount/backups/employees 2
Note:

HTTP storage can only be used for IMPORT and CREATE CHANGEFEED.

Encryption

Transport Layer Security (TLS) is used for encryption in transit when transmitting data to or from Amazon S3, Google Cloud Storage, and Azure.

For encryption at rest, if your cloud provider offers transparent data encryption, you can use that to ensure that your backups are not stored on disk in cleartext.

CockroachDB also provides client-side encryption of backup data, for more information, see Take and Restore Encrypted Backups.

Authentication

When running bulk operations to and from a storage bucket, authentication setup can vary depending on the cloud provider. This section details the necessary steps to authenticate to each cloud provider.

Note:

implicit authentication cannot be used to run bulk operations from CockroachDB Cloud clusters—instead, use AUTH=specified.

The AUTH parameter passed to the file URL must be set to either specified or implicit. The following sections describe how to set up each authentication method.

Specified authentication

If the AUTH parameter is not provided, AWS connections default to specified and the access keys must be provided in the URI parameters.

As an example:

icon/buttons/copy
BACKUP DATABASE <database> INTO 's3://{bucket name}/{path in bucket}/?AWS_ACCESS_KEY_ID={access key ID}&AWS_SECRET_ACCESS_KEY={secret access key}';

Implicit authentication

If the AUTH parameter is implicit, the access keys can be omitted and the credentials will be loaded from the environment (i.e., the machines running the backup).

icon/buttons/copy
BACKUP DATABASE <database> INTO 's3://{bucket name}/{path}?AUTH=implicit';

You can associate an EC2 instance with an IAM role to provide implicit access to S3 storage within the IAM role's policy. In the following command, the instance example EC2 instance is associated with the example profile instance profile, giving the EC2 instance implicit access to any example profile S3 buckets.

icon/buttons/copy
aws ec2 associate-iam-instance-profile --iam-instance-profile Name={example profile} --region={us-east-2} --instance-id {instance example}

The AUTH parameter passed to the file URL must be set to either specified or implicit. The default behavior is specified in v21.2+. The following sections describe how to set up each authentication method.

Specified authentication

To access the storage bucket with specified credentials, it's necessary to create a service account and add the service account address to the permissions on the specific storage bucket.

The JSON credentials file for authentication can be downloaded from the Service Accounts page in the Google Cloud Console and then base64-encoded:

icon/buttons/copy
cat gcs_key.json | base64

Pass the encoded JSON object to the CREDENTIALS parameter:

icon/buttons/copy
BACKUP DATABASE <database> INTO 'gs://{bucket name}/{path}?AUTH=specified&CREDENTIALS={encoded key}';

Implicit authentication

For CockroachDB instances that are running within a Google Cloud Environment, environment data can be used from the service account to implicitly access resources within the storage bucket.

For CockroachDB clusters running in other environments, implicit authentication access can still be set up manually with the following steps:

  1. Create a service account and add the service account address to the permissions on the specific storage bucket.

  2. Download the JSON credentials file from the Service Accounts page in the Google Cloud Console to the machines that CockroachDB is running on. (Since this file will be passed as an environment variable, it does not need to be base64-encoded.) Ensure that the file is located in a path that CockroachDB can access.

  3. Create an environment variable instructing CockroachDB where the credentials file is located. The environment variable must be exported on each CockroachDB node:

    icon/buttons/copy
    export GOOGLE_APPLICATION_CREDENTIALS="/{cockroach}/gcs_key.json"
    

    Alternatively, to pass the credentials using systemd, use systemctl edit cockroach.service to add the environment variable Environment="GOOGLE_APPLICATION_CREDENTIALS=gcs-key.json" under [Service] in the cockroach.service unit file. Then, run systemctl daemon-reload to reload the systemd process. Restart the cockroach process on each of the cluster's nodes with systemctl restart cockroach, which will reload the configuration files.

    To pass the credentials using code, see Google's Authentication documentation.

  4. Run a backup (or other bulk operation) to the storage bucket with the AUTH parameter set to implicit:

    icon/buttons/copy
    BACKUP DATABASE <database> INTO 'gs://{bucket name}/{path}?AUTH=implicit';
    
Note:

If the use of implicit credentials is disabled with --external-io-disable-implicit-credentials flag, an error will be returned when accessing external cloud storage services for various bulk operations when using AUTH=implicit.

To access Azure storage containers, it is sometimes necessary to url encode the account key since it is base64-encoded and may contain +, /, = characters. For example:

icon/buttons/copy
BACKUP DATABASE <database> INTO 'azure://{container name}/{path}?AZURE_ACCOUNT_NAME={account name}&AZURE_ACCOUNT_KEY={url-encoded key}';

If your environment requires an HTTP or HTTPS proxy server for outgoing connections, you can set the standard HTTP_PROXY and HTTPS_PROXY environment variables when starting CockroachDB. You can create your own HTTP server with NGINX. A custom root CA can be appended to the system's default CAs by setting the cloudstorage.http.custom_ca cluster setting, which will be used when verifying certificates from HTTPS URLs.

If you cannot run a full proxy, you can disable external HTTP(S) access (as well as custom HTTP(S) endpoints) when importing by using the --external-io-disable-http flag.

Warning:

Unlike Amazon S3, Google Cloud Storage, and Azure Storage options, the usage of S3-compatible services is not actively tested by Cockroach Labs.

A custom root CA can be appended to the system's default CAs by setting the cloudstorage.http.custom_ca cluster setting, which will be used when verifying certificates from an S3-compatible service.

Storage permissions

This section describes the minimum permissions required to run CockroachDB bulk operations. While we outline information on the necessary permissions to access Amazon S3 and Google Cloud Storage when running these operations, the provider's documentation provides detail on the setup process and different options regarding access management.

Depending on the actions a bulk operation performs, it will require different access permissions to a cloud storage bucket.

This table outlines the actions that each operation performs against the storage bucket:

Operation Permission Description
Backup Write Backups write the backup data to the bucket/container. During a backup job, a BACKUP CHECKPOINT file will be written that tracks the progress of the backup.
Get Backups need get access after a pause to read the checkpoint files on resume.
List Backups need list access to the files already in the bucket. For example, BACKUP uses list to find previously taken backups when executing an incremental backup.
Delete Backups may delete files and progress during the job after a pause in order to resume the job from a certain point. This can include overwriting backup work that had happened before the pause.
Restore Get Restores need access to retrieve files from the backup. Restore also requires access to the LATEST file in order to read the latest available backup.
List Restores need list access to the files already in the bucket to find other backups in the backup collection. This contains metadata files that describe the backup, the LATEST file, and other versioned subdirectories and files.
Import Get Imports read the requested file(s) from the storage bucket.
Export Write Exports need write access to the storage bucket to create individual export file(s) from the exported data.
Enterprise changefeeds Write Changefeeds will write files to the storage bucket that contain row changes and resolved timestamps.

These actions are the minimum access permissions to be set in an Amazon S3 bucket policy:

Operation S3 permission
Backup s3:PutObject, s3:GetObject, s3:ListBucket, s3:DeleteObject
Restore s3:GetObject, s3:ListBucket
Import s3:GetObject
Export s3:PutObject
Enterprise Changefeeds s3:PutObject

See Policies and Permissions in Amazon S3 for detail on setting policies and permissions in Amazon S3.

An example S3 bucket policy for a backup:

{
    "Version": "2012-10-17",
    "Id": "Example_Policy",
    "Statement": [
        {
            "Sid": "ExampleStatement01",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::{ACCOUNT_ID}:user/{USER}"
            },
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::{BUCKET_NAME}",
                "arn:aws:s3:::{BUCKET_NAME}/*"
            ]
        }
    ]
}

In Google Cloud Storage, you can grant users roles that define their access level to the storage bucket. For the purposes of running CockroachDB operations to your bucket, the following table lists the permissions that represent the minimum level required for each operation. GCS provides different levels of granularity for defining the roles in which these permissions reside. You can assign roles that already have these permissions configured, or make your own custom roles that include these permissions.

For more detail about Predefined, Basic, and Custom roles, see IAM roles for Cloud Storage.

Operation GCS Permission
Backup storage.objects.create, storage.objects.get, storage.objects.list, storage.objects.delete
Restore storage.objects.get, storage.objects.list
Import storage.objects.get
Export storage.objects.create
Changefeeds storage.objects.create

For guidance on adding a user to a bucket's policy, see Add a principal to a bucket-level policy.

Additional cloud storage feature support

Amazon S3 storage classes

New in v21.2.6: When storing objects in Amazon S3 buckets during backups, exports, and changefeeds, you can specify the S3_STORAGE_CLASS={class} parameter in the URI to configure a storage class type.

The following S3 connection URI uses the INTELLIGENT_TIERING storage class:

's3://{BUCKET NAME}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}&S3_STORAGE_CLASS=INTELLIGENT_TIERING'

While Cockroach Labs supports configuring an AWS storage class, we only test against S3 Standard. We recommend implementing your own testing with other storage classes.

Note:

Incremental backups are not compatible with the S3 Glacier Flexible Retrieval or Glacier Deep Archive storage classes. Incremental backups require ad-hoc reading of previous backups, which is not possible with the Glacier Flexible Retrieval or Glacier Deep Archive storage classes as they do not allow immediate access to S3 objects without first restoring the objects. See Amazon's documentation on Restoring an archived object for more detail.

This table lists the valid CockroachDB parameters that map to an S3 storage class:

CockroachDB parameter AWS S3 storage class
STANDARD S3 Standard
REDUCED_REDUNDANCY Reduced redundancy Note: Amazon recommends against using this storage class.
STANDARD_IA Standard Infrequent Access
ONEZONE_IA One Zone Infrequent Access
INTELLIGENT_TIERING Intelligent Tiering
GLACIER Glacier Flexible Retrieval
DEEP_ARCHIVE Glacier Deep Archive
OUTPOSTS Outpost
GLACIER_IR Glacier Instant Retrieval

You can view an object's storage class in the Amazon S3 Console from the object's Properties tab. Alternatively, use the AWS CLI to list objects in a bucket, which will also display the storage class:

aws s3api list-objects-v2 --bucket {bucket-name}
{
    "Key": "2022/05/02-180752.65/metadata.sst",
    "LastModified": "2022-05-02T18:07:54+00:00",
    "ETag": "\"c0f499f21d7886e4289d55ccface7527\"",
    "Size": 7865,
    "StorageClass": "STANDARD"
},
    ...

    "Key": "2022-05-06/202205061217256387084640000000000-1b4e610c63535061-1-2-00000000-
users-7.ndjson",
    "LastModified": "2022-05-06T12:17:26+00:00",
    "ETag": "\"c60a013619439bf83c505cb6958b55e2\"",
    "Size": 94596,
    "StorageClass": "INTELLIGENT_TIERING"
},

For a specific operation, see the following examples:

See also


Yes No
On this page

Yes No