This page provides best practices for optimizing import performance in CockroachDB.
Import speed primarily depends on the amount of data that you want to import. However, there are two main factors that have can have a large impact on the amount of time it will take to run an import:
If the import size is small, then you do not need to do anything to optimize performance. In this case, the import should run quickly, regardless of the settings.
Split your data into multiple files
Splitting the import data into multiple files can have a large impact on the import performance. The following formats support multi-file import using IMPORT INTO
:
CSV
DELIMITED DATA
AVRO
For these formats, we recommend splitting your data into as many files as there are nodes.
For example, if you have a 3-node cluster, split your data into 3 files, create your table, and import into that table:
CREATE TABLE customers (id UUID PRIMARY KEY, name TEXT, INDEX name_idx(name));
IMPORT INTO customers (id, name)
CSV DATA (
's3://{BUCKET NAME}/{customers.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers2.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers3.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
's3://{BUCKET NAME}/{customers4.csv}?AWS_ACCESS_KEY_ID={KEY ID}&AWS_SECRET_ACCESS_KEY={SECRET ACCESS KEY}',
);
CockroachDB imports the files that you give it, and does not further split them. For example, if you import one large file for all of your data, CockroachDB will process that file on one node—even if you have more nodes available. However, if you import two files (and your cluster has at least two nodes), each node will process a file in parallel. This is why splitting your data into as many files as you have nodes will dramatically decrease the time it takes to import data.
If you split the data into more files than you have nodes, it will not have a large impact on performance.
File storage during import
During migration, all of the features of IMPORT
that interact with external file storage assume that every node has the exact same view of that storage. In other words, in order to import from a file, every node needs to have the same access to that file.
Choose a performant import format
Import formats do not have the same performance because of the way they are processed. Below, import formats are listed from fastest to slowest:
CSV
orDELIMITED DATA
(both have about the same import performance).AVRO
.
We recommend formatting your import files as CSV
, DELIMITED DATA
, or AVRO
. These formats can be processed in parallel by multiple threads, which increases performance. To import in these formats, use IMPORT INTO
.
As of v22.1, certain IMPORT TABLE
statements that defined the table schema inline are not supported. See Import — Considerations for more details. To import data into a new table, use CREATE TABLE
followed by IMPORT INTO
. For an example, read Import into a new table from a CSV file.
Import the schema separately from the data
Split your dump data into two files:
- A SQL file containing the table schema.
- A CSV file containing the table data.
Convert the schema-only file using the Schema Conversion Tool. The Schema Conversion Tool automatically creates a new CockroachDB Serverless database with the converted schema.
If you are migrating to a CockroachDB Self-Hosted database, you can export the converted schema and execute the statements in cockroach sql
, or use a third-party schema migration tool such as Alembic, Flyway, or Liquibase.
Then use the IMPORT INTO
statement to import the CSV data into the newly created table:
> IMPORT INTO customers (id, name)
CSV DATA (
'https://s3-us-west-1.amazonaws.com/cockroachdb-movr/datasets/employees-db/pg_dump/customers.csv'
);
This method has the added benefit of alerting on potential issues with the import sooner; that is, you will not have to wait for the file to load both the schema and data just to find an error in the schema.
Import into a schema with secondary indexes
When importing data into a table with secondary indexes, the import job will ingest the table data and required secondary index data concurrently. This may result in a longer import time compared to a table without secondary indexes. However, this typically adds less time to the initial import than following it with a separate pass to add the indexes. As a result, importing tables with their secondary indexes is the default workflow, suitable for most import jobs.
However, in large imports, it may be preferable to remove the secondary indexes from the schema, perform the import, and then re-create the indexes separately. This provides increased visibility into its progress and ability to retry each step independently.
Data type sizes
Above a certain size, many data types such as STRING
s, DECIMAL
s, ARRAY
, BYTES
, and JSONB
may run into performance issues due to write amplification. See each data type's documentation for its recommended size limits.