Transaction Retry Error Reference

On this page Carat arrow pointing down

When a transaction is unable to complete due to contention with another concurrent or recent transaction attempting to write to the same data, CockroachDB will automatically attempt to retry the failed transaction without involving the client (i.e., silently). If the automatic retry is not possible or fails, a transaction retry error is emitted to the client.

Transaction retry errors fall into two categories:

All transaction retry errors use the SQLSTATE error code 40001, and emit error messages with the string restart transaction. Further, each error includes a specific error code to assist with targeted troubleshooting.

When experiencing transaction retry errors, you should follow the guidance under Actions to take, and then consult the reference for your specific transaction retry error for guidance specific to the error message encountered.

Overview

CockroachDB always attempts to find a serializable ordering among all of the currently-executing transactions.

Whenever possible, CockroachDB will auto-retry a transaction internally without notifying the client. CockroachDB will only send a serialization error to the client when it cannot resolve the error automatically without client-side intervention.

The main reason why CockroachDB cannot auto-retry every serialization error without sending an error to the client is that the SQL language is "conversational" by design. The client can send arbitrary statements to the server during a transaction, receive some results, and then decide to issue other arbitrary statements inside the same transaction based on the server's response.

Actions to take

In most cases, the correct actions to take when encountering transaction retry errors are:

  1. Update your application to support client-side retry handling when transaction retry errors are encountered.

  2. Adjust your application logic to minimize transaction retry errors in the first place.

Client-side retry handling

Your application should include client-side retry handling when the statements are sent individually, such as:

> BEGIN;

> UPDATE products SET inventory = 0 WHERE sku = '8675309';

> INSERT INTO orders (customer, status) VALUES (1, 'new');

> COMMIT;

To indicate that a transaction must be retried, CockroachDB signals an error with the SQLSTATE error code 40001 (serialization error) and an error message that begins with the string "restart transaction".

To handle these types of errors, you have the following options:

Client-side retry handling example

For a conceptual example of application-defined retry logic, and testing that logic against your application's needs, see the client-side retry handling example.

Minimize transaction retry errors

In addition to the steps described in Client-side retry handling, which detail how to configure your application to restart a failed transaction, there are also a number of changes you can make to your application logic to increase the chance that CockroachDB can automatically retry a failed transaction, and to reduce the number of transaction retry errors that reach the client application in the first place:

  1. Limit the number of affected rows by following performance-tuning best practices (e.g., query performance tuning, index design and maintenance, etc.). Not only will transactions run faster and hold locks for a shorter duration, but the chances of read invalidation when the transaction’s timestamp is pushed due to a conflicting write is decreased due to a smaller read set (i.e., a smaller number of rows read).

  2. Break down larger transactions into smaller ones (e.g., bulk deletes) to have transactions hold locks for a shorter duration. This will also decrease the likelihood of pushed timestamps and retry errors. For instance, as the size of writes (number of rows written) decreases, the chances of the (bulk delete) transaction’s timestamp getting bumped by concurrent reads decreases.

  3. Design your applications to reduce network round trips by sending statements in transactions as a single batch (e.g., using common table expressions). Batching allows CockroachDB to automatically retry a transaction when previous reads are invalidated at a pushed timestamp. When a multi-statement transaction is not batched, and takes more than a single round trip, CockroachDB cannot automatically retry the transaction.

  4. Limit the size of the result sets of your transactions to under 16KB, so that CockroachDB is more likely to automatically retry when previous reads are invalidated at a pushed timestamp. When a transaction returns a result set over 16KB, even if that transaction has been sent as a single batch, CockroachDB cannot automatically retry the transaction.

  5. Use SELECT FOR UPDATE to aggressively lock rows that will later be updated in the transaction. Locking (blocking) earlier in the transaction will not allow other concurrent write transactions to conflict which leads to a situation where we would return out-of-date information subsequently returning a retry error (RETRY_WRITE_TOO_OLD). See When and why to use SELECT FOR UPDATE in CockroachDB for more information.

  6. Use historical reads (SELECT ... AS OF SYSTEM TIME), preferably bounded staleness reads or exact staleness with follower reads when possible to reduce conflicts with other writes. This reduces the likelihood of conflicts as fewer writes will happen at the historical timestamp. More specifically, writes’ timestamps are less likely to be pushed by historical reads as they would when the read has a higher priority level.

  7. If applicable to your workload, assign column families and separate columns that are frequently read and written into separate columns. Transactions will operate on disjoint column families and reduce the likelihood of conflicts.

  8. As a last resort, consider adjusting the closed timestamp interval using the kv.closed_timestamp.target_duration cluster setting to reduce the likelihood of long-running write transactions having their timestamps pushed. This setting should be carefully adjusted if no other mitigations are available because there can be downstream implications (e.g., Historical reads, change data capture feeds, Stats collection, handling zone configurations, etc.). For example, a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read. See the reference entry for RETRY_SERIALIZABLE for more information.

Transaction retry error reference

Note that your application's retry logic does not need to distinguish between the different types of serialization errors. They are listed here for reference during advanced troubleshooting.

Each transaction retry error listed includes an example error as it would appear from the context of the client, a description of the circumstances that cause that error, and specific guidance for addressing the error.

RETRY_WRITE_TOO_OLD

TransactionRetryWithProtoRefreshError: ... RETRY_WRITE_TOO_OLD ...

Error type: Serialization error

Description:

The RETRY_WRITE_TOO_OLD error occurs when a transaction A tries to write to a row R, but another transaction B that was supposed to be serialized after A (i.e., had been assigned a higher timestamp), has already written to that row R, and has already committed. This is a common error when you have too much contention in your workload.

Action:

  1. Retry transaction A as described in client-side retry handling.
  2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:

    1. Send all of the statements in your transaction in a single batch.
    2. Use SELECT FOR UPDATE to aggressively lock rows that will later be updated in the transaction.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_SERIALIZABLE

TransactionRetryWithProtoRefreshError: ... RETRY_SERIALIZABLE ...

Error type: Serialization error

Description:

The RETRY_SERIALIZABLE error occurs in the following cases:

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read, and B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp, but it cannot do that when another transaction has subsequently written to some of the keys that A has read and returned to the client. When that happens, the RETRY_SERIALIZATION error is signalled. For more information about how timestamp pushes work in our transaction model, see the architecture docs on the transaction layer's timestamp cache.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B, and some other transaction C writes to a key that B has already read. Transaction B will get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client, and C has written data previously read by B.

  3. When a transaction A is forced to refresh (i.e., change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read. If this is the cause of the error, the solution is to increase the kv.closed_timestamp.target_duration setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).

Action:

  1. If you encounter case 1 or 2 above, the solution is to:

    1. Retry transaction A as described in client-side retry handling.
    2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:
      1. Send all of the statements in your transaction in a single batch.
      2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
  2. If you encounter case 3 above, the solution is to:

    1. Increase the kv.closed_timestamp.target_duration setting to a higher value. As described above, this will impact the freshness of data available via Follower Reads and CDC changefeeds.
    2. Retry transaction A as described in client-side retry handling.
    3. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:
      1. Send all of the statements in your transaction in a single batch.
      2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_ASYNC_WRITE_FAILURE

TransactionRetryWithProtoRefreshError: ... RETRY_ASYNC_WRITE_FAILURE ...

Error type: Internal state error

Description:

The RETRY_ASYNC_WRITE_FAILURE error occurs when some kind of problem with your cluster's operation occurs at the moment of a previous write in the transaction, causing CockroachDB to fail to replicate one of the transaction's writes. For example, this can happen if you have a networking partition that cuts off access to some nodes in your cluster.

Action:

  1. Retry the transaction as described in client-side retry handling. This is worth doing because the problem with the cluster is likely to be transient.
  2. Investigate the problems with your cluster. For cluster troubleshooting information, see Troubleshoot Cluster Setup.

See Minimize transaction retry errors for the full list of recommended remediations.

ReadWithinUncertaintyInterval

TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError:
        read at time 1591009232.376925064,0 encountered previous write with future timestamp 1591009232.493830170,0 within uncertainty interval `t <= 1591009232.587671686,0`;
        observed timestamps: [{1 1591009232.587671686,0} {5 1591009232.376925064,0}]

Error type: Serialization error

Description:

The ReadWithinUncertaintyIntervalError can occur when two transactions which start on different gateway nodes attempt to operate on the same data at close to the same time, and one of the operations is a write. The uncertainty comes from the fact that we cannot tell which one started first - the clocks on the two gateway nodes may not be perfectly in sync.

For example, if the clock on node A is ahead of the clock on node B, a transaction started on node A may be able to commit a write with a timestamp that is still in the "future" from the perspective of node B. A later transaction that starts on node B should be able to see the earlier write from node A, even if B's clock has not caught up to A. The "read within uncertainty interval" occurs if we discover this situation in the middle of a transaction, when it is too late for the database to handle it automatically. When node B's transaction retries, it will unambiguously occur after the transaction from node A.

Note:

This behavior is non-deterministic: it depends on which node is the leaseholder of the underlying data range. It’s generally a sign of contention. Uncertainty errors are always possible with near-realtime reads under contention.

Action:

The solution is to do one of the following:

  1. Be prepared to retry on uncertainty (and other) errors, as described in client-side retry handling.
  2. Adjust your application logic as described in minimize transaction retry errors. In particular, try to:
    1. Send all of the statements in your transaction in a single batch.
    2. Use historical reads with SELECT ... AS OF SYSTEM TIME.
  3. If you trust your clocks, you can try lowering the --max-offset option to cockroach start, which provides an upper limit on how long a transaction can continue to restart due to uncertainty.
Note:

Uncertainty errors are a form of transaction conflict. For more information about transaction conflicts, see Transaction conflicts.

See Minimize transaction retry errors for the full list of recommended remediations.

RETRY_COMMIT_DEADLINE_EXCEEDED

TransactionRetryWithProtoRefreshError: TransactionPushError: transaction deadline exceeded ...

Error type: Serialization error

Description:

The RETRY_COMMIT_DEADLINE_EXCEEDED error means that the transaction timed out due to being pushed by other concurrent transactions. This error is most likely to happen to long-running transactions. The conditions that trigger this error are very similar to the conditions that lead to a RETRY_SERIALIZABLE error, except that a transaction that hits this error got pushed for several minutes, but did not hit any of the conditions that trigger a RETRY_SERIALIZABLE error. In other words, the conditions that trigger this error are a subset of those that trigger RETRY_SERIALIZABLE, and that this transaction ran for too long (several minutes).

Note:

Read-only transactions do not get pushed, so they do not run into this error.

This error occurs in the cases described below.

  1. When a transaction A has its timestamp moved forward (also known as A being "pushed") as CockroachDB attempts to find a serializable transaction ordering. Specifically, transaction A tried to write a key that transaction B had already read. B was supposed to be serialized after A (i.e., B had a higher timestamp than A). CockroachDB will try to serialize A after B by changing A's timestamp.

  2. When a high-priority transaction A does a read that runs into a write intent from another lower-priority transaction B. Transaction B may get this error when it tries to commit, because A has already read some of the data touched by B and returned results to the client.

  3. When a transaction A is forced to refresh (change its timestamp) due to hitting the maximum closed timestamp interval (closed timestamps enable Follower Reads and Change Data Capture (CDC)). This can happen when transaction A is a long-running transaction, and there is a write by another transaction to data that A has already read.

Action:

  1. The RETRY_COMMIT_DEADLINE_EXCEEDED error is one case where the standard advice to add a retry loop to your application may not be advisable. A transaction that runs for long enough to get pushed beyond its deadline is quite likely to fail again on retry for the same reasons. Therefore, the best thing to do in this case is to shrink the running time of your transactions so they complete more quickly and do not hit the deadline.
  2. If you encounter case 3 above, you can increase the kv.closed_timestamp.target_duration setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).
Note:

If you increase the kv.closed_timestamp.target_duration setting, it means that you are increasing the amount of time by which the data available in Follower Reads and CDC changefeeds lags behind the current state of the cluster. In other words, there is a trade-off here: if you absolutely must execute long-running transactions that execute concurrently with other transactions that are writing to the same data, you may have to settle for longer delays on Follower Reads and/or CDC to avoid frequent serialization errors. The anomaly that would be exhibited if these transactions were not retried is called write skew.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_ABORTED_RECORD_FOUND

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND) ...

Error type: Serialization error

Description:

The ABORT_REASON_ABORTED_RECORD_FOUND error means that the client application is trying to use a transaction that has been aborted. This happens in one of the following cases:

  • Write-write conflict: Another high-priority transaction B encountered a write intent by our transaction A, and tried to push A's timestamp.
  • Cluster overload: B thinks that A's transaction coordinator node is dead, because the coordinator node hasn't heartbeated the transaction record for a few seconds.
  • Deadlock: Some transaction B is trying to acquire conflicting locks in reverse order from transaction A.

Action:

If you are encountering deadlocks:

  • Avoid producing deadlocks in your application by making sure that transactions acquire locks in the same order.

If you are using only default transaction priorities:

  • This error means your cluster has problems. You are likely overloading it. Investigate the source of the overload, and do something about it. For more information, see Node liveness issues.

If you are using high- or low-priority transactions:

  1. Retry the transaction as described in client-side retry handling
  2. Adjust your application logic as described in minimize transaction retry errors.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_CLIENT_REJECT

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_CLIENT_REJECT) ...

Error type: Serialization error

Description:

The ABORT_REASON_CLIENT_REJECT error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_PUSHER_ABORTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) ...

Error type: Serialization error

Description:

The ABORT_REASON_PUSHER_ABORTED error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_ABORT_SPAN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_ABORT_SPAN) ...

Error type: Serialization error

Description:

The ABORT_REASON_ABORT_SPAN error is caused by the same conditions as the ABORT_REASON_ABORTED_RECORD_FOUND, and requires the same actions. The errors are fundamentally the same, except that they are discovered at different points in the process.

See Minimize transaction retry errors for the full list of recommended remediations.

ABORT_REASON_NEW_LEASE_PREVENTS_TXN

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_NEW_LEASE_PREVENTS_TXN) ...

Error type: Internal state error

Description:

The ABORT_REASON_NEW_LEASE_PREVENTS_TXN error occurs because the timestamp cache will not allow transaction A to create a transaction record. A new lease wipes the timestamp cache, so this could mean the leaseholder was moved and the duration of transaction A was unlucky enough to happen across a lease acquisition. In other words, leaseholders got shuffled out from underneath transaction A (due to no fault of the client application or schema design), and now it has to be retried.

Action:

Retry transaction A as described in client-side retry handling.

ABORT_REASON_TIMESTAMP_CACHE_REJECTED

TransactionRetryWithProtoRefreshError:TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED) ...

Error type: Internal state error

Description:

The ABORT_REASON_TIMESTAMP_CACHE_REJECTED error occurs when the timestamp cache will not allow transaction A to create a transaction record. This can happen due to a range merge happening in the background, or because the timestamp cache is an in-memory cache, and has outgrown its memory limit (about 64 MB).

Action:

Retry transaction A as described in client-side retry handling.

injected by inject_retry_errors_enabled session variable

 TransactionRetryWithProtoRefreshError: injected by `inject_retry_errors_enabled` session variable

Error type: Internal state error

Description:

When the inject_retry_errors_enabled session variable is set to true, any statement (with the exception of SET statements) executed in the session inside of an explicit transaction will return this error.

For more details, see Testing transaction retry logic.

Action:

To turn off error injection, set the inject_retry_errors_enabled session variable to false.

See also


Yes No
On this page

Yes No