Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 68 additions & 7 deletions src/current/v25.4/manage-logical-data-replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,7 @@ When a conflict cannot apply due to violating [constraints]({% link {{ page.vers

### Dead letter queue (DLQ)

When the LDR job starts, it will create a DLQ table with each replicating table so that unresolved conflicts can be tracked. The DLQ will contain the writes that LDR cannot apply after the retry period of a minute, which could occur if there is a unique index on the destination table (for more details, refer to [Unique seconday indexes]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#unique-secondary-indexes)).

{{site.data.alerts.callout_info}}
LDR will not pause when the writes are sent to the DLQ, you must manage the DLQ manually.
{{site.data.alerts.end}}

To manage the DLQ, you can evaluate entries in the `incoming_row` column and apply the row manually to another table with SQL statements.
When the LDR job starts, it creates a DLQ table with each replicating table so that unresolved conflicts can be tracked. The DLQ contains the writes that LDR cannot apply after the retry period of a minute, which could occur if there is a unique index on the destination table (for more details, refer to [Unique secondary indexes]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#unique-secondary-indexes)).

As an example, for an LDR stream created on the `movr.public.promo_codes` table:

Expand Down Expand Up @@ -80,6 +74,73 @@ CONSTRAINT dlq_113_public_promo_codes_pkey PRIMARY KEY (ingestion_job_id ASC, dl
)
~~~

#### Resolve rows in the DLQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Resolve rows in the DLQ
#### Manage rows in the DLQ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels better as a verb for the header, since it's not clear from the previous section that anything is in an "unresolved" state.


LDR does not pause when writes are sent to the DLQ. You must manage the DLQ manually by examining each entry in the DLQ and either manually reinserting the row or deleting the entry from the DLQ. If you have multiple DLQ entries, resolve them in order from most recent to least recent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LDR does not pause when writes are sent to the DLQ. You must manage the DLQ manually by examining each entry in the DLQ and either manually reinserting the row or deleting the entry from the DLQ. If you have multiple DLQ entries, resolve them in order from most recent to least recent.
LDR does not pause when writes are sent to the DLQ. You must manage the DLQ manually by examining each entry in the DLQ and either reinserting the row or deleting the entry from the DLQ. If you have multiple DLQ entries, resolve them in order from most recent to least recent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2x "manually"


To resolve a row in the DLQ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To resolve a row in the DLQ:
To resolve an entry in the DLQ:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section uses "row" and "entry" interchangeably, should stick to one


1. On the destination, find the primary key value in the `incoming_row` column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. On the destination, find the primary key value in the `incoming_row` column.
1. On the destination cluster's DLQ table, find the primary key value in the `incoming_row` column.


{% include_cached copy-clipboard.html %}
~~~ sql
SELECT id, dlq_timestamp, incoming_row FROM crdb_replication.dlq_271_foo;
~~~

In this example result, `incoming_row` contains a primary key value of `207` identified by the column `my_id`:

{% include_cached copy-clipboard.html %}
~~~ sql
id | dlq_timestamp | incoming_row
----------------------+---------------------+----------+-------------------------------+-----------------------------------------------------------------
106677386757203 | 2025-04-25 25:32:28.435439+00 | {"created_at": "2025-04-25:35:00.499499", "payload": "blahblahblah=", "my_id": 207}
~~~

1. Determine whether the value of the row matches on the source and the destination:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Determine whether the value of the row matches on the source and the destination:
1. Determine whether the value of the row in the DLQ matches the values on the source and destination tables respectively:


1. Check the value of the row and the replicated time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that this is run on the DLQ table on the destination cluster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The parent step reads "source and destination" in that order but these are ordered destination then source. Suggest making those consistent


{% include_cached copy-clipboard.html %}
~~~ sql
SELECT * FROM foo WHERE my_id = 207;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more clear throughout about whether these operations are taking place on the source table, the destination table, or the DLQ for the table. Can we use fully-qualified table names here with corresponding DB names in order to make it more clear? Or are the fully-qualified names going to be identical regardless of which cluster it's on?

SELECT replicated_time FROM show logical replication jobs;
~~~

1. On the source, check the value of the row as of the replicated time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source cluster


{% include_cached copy-clipboard.html %}
~~~ sql
SELECT * FROM foo WHERE my_id = 207 AS OF SYSTEM TIME {replicated time};
~~~

1. Determine a course of action based on the results of the previous steps:

1. If the value of the row is the same on both the source and the destination, delete the row from the DLQ on the destination:

{% include_cached copy-clipboard.html %}
~~~ sql
DELETE FROM crdb_replication.dlq_271_foo WHERE id = 106677386757203;
~~~

1. If the row's value on the destination is different from its value on the source, but the row's value on the source equals its value in the DLQ, update the row on the destination to have the same value as on the source:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same state as the following step, sounds like one of them is meant to say "but the row's value on the source is also different from its value in the DLQ" or similar.


{% include_cached copy-clipboard.html %}
~~~ sql
UPSERT into foo VALUES (207, '2025-04-25:35:00.499499', 'blahblahblah=')
~~~

1. If the row's value on the destination is different from its value on the source, and the row's value on the source equals its value in the DLQ, refresh the replicated time and retry the equality queries above. If the same results hold after a few retries with refreshed replicated times, there is likely a more recent entry for the row in the DLQ.

1. To find the more recent entry, find all rows in the DLQ with the matching primary key:

{% include_cached copy-clipboard.html %}
~~~ sql
# On the destination:
SELECT id, dlq_timestamp, incoming_row FROM crdb_replication.dlq_271_foo WHERE incoming_row->>'my_id' = 207;
~~~

1. If there are more recent entries for the row, delete the less recent entries and resolve the row using the most recent entry.

## Schema changes

When you start LDR on a table, the job will lock the schema, which will prevent any accidental [schema changes]({% link {{ page.version.version }}/online-schema-changes.md %}) that would cause issues for LDR. There are some [supported schema changes](#supported-schema-changes) that you can perform on a replicating table, otherwise it is necessary to stop LDR in order to [coordinate the schema change](#coordinate-other-schema-changes).
Expand Down
75 changes: 68 additions & 7 deletions src/current/v26.1/manage-logical-data-replication.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above

Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,7 @@ When a conflict cannot apply due to violating [constraints]({% link {{ page.vers

### Dead letter queue (DLQ)

When the LDR job starts, it will create a DLQ table with each replicating table so that unresolved conflicts can be tracked. The DLQ will contain the writes that LDR cannot apply after the retry period of a minute, which could occur if there is a unique index on the destination table (for more details, refer to [Unique seconday indexes]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#unique-secondary-indexes)).

{{site.data.alerts.callout_info}}
LDR will not pause when the writes are sent to the DLQ, you must manage the DLQ manually.
{{site.data.alerts.end}}

To manage the DLQ, you can evaluate entries in the `incoming_row` column and apply the row manually to another table with SQL statements.
When the LDR job starts, it creates a DLQ table with each replicating table so that unresolved conflicts can be tracked. The DLQ contains the writes that LDR cannot apply after the retry period of a minute, which could occur if there is a unique index on the destination table (for more details, refer to [Unique secondary indexes]({% link {{ page.version.version }}/set-up-logical-data-replication.md %}#unique-secondary-indexes)).

As an example, for an LDR stream created on the `movr.public.promo_codes` table:

Expand Down Expand Up @@ -80,6 +74,73 @@ CONSTRAINT dlq_113_public_promo_codes_pkey PRIMARY KEY (ingestion_job_id ASC, dl
)
~~~

#### Resolve rows in the DLQ

LDR does not pause when writes are sent to the DLQ. You must manage the DLQ manually by examining each entry in the DLQ and either manually reinserting the row or deleting the entry from the DLQ. If you have multiple DLQ entries, resolve them in order from most recent to least recent.

To resolve a row in the DLQ:

1. On the destination, find the primary key value in the `incoming_row` column.

{% include_cached copy-clipboard.html %}
~~~ sql
SELECT id, dlq_timestamp, incoming_row FROM crdb_replication.dlq_271_foo;
~~~

In this example result, `incoming_row` contains a primary key value of `207` identified by the column `my_id`:

{% include_cached copy-clipboard.html %}
~~~ sql
id | dlq_timestamp | incoming_row
----------------------+---------------------+----------+-------------------------------+-----------------------------------------------------------------
106677386757203 | 2025-04-25 25:32:28.435439+00 | {"created_at": "2025-04-25:35:00.499499", "payload": "blahblahblah=", "my_id": 207}
~~~

1. Determine whether the value of the row matches on the source and the destination:

1. Check the value of the row and the replicated time:

{% include_cached copy-clipboard.html %}
~~~ sql
SELECT * FROM foo WHERE my_id = 207;
SELECT replicated_time FROM show logical replication jobs;
~~~

1. On the source, check the value of the row as of the replicated time:

{% include_cached copy-clipboard.html %}
~~~ sql
SELECT * FROM foo WHERE my_id = 207 AS OF SYSTEM TIME {replicated time};
~~~

1. Determine a course of action based on the results of the previous steps:

1. If the value of the row is the same on both the source and the destination, delete the row from the DLQ on the destination:

{% include_cached copy-clipboard.html %}
~~~ sql
DELETE FROM crdb_replication.dlq_271_foo WHERE id = 106677386757203;
~~~

1. If the row's value on the destination is different from its value on the source, but the row's value on the source equals its value in the DLQ, update the row on the destination to have the same value as on the source:

{% include_cached copy-clipboard.html %}
~~~ sql
UPSERT into foo VALUES (207, '2025-04-25:35:00.499499', 'blahblahblah=')
~~~

1. If the row's value on the destination is different from its value on the source, and the row's value on the source equals its value in the DLQ, refresh the replicated time and retry the equality queries above. If the same results hold after a few retries with refreshed replicated times, there is likely a more recent entry for the row in the DLQ.

1. To find the more recent entry, find all rows in the DLQ with the matching primary key:

{% include_cached copy-clipboard.html %}
~~~ sql
# On the destination:
SELECT id, dlq_timestamp, incoming_row FROM crdb_replication.dlq_271_foo WHERE incoming_row->>'my_id' = 207;
~~~

1. If there are more recent entries for the row, delete the less recent entries and resolve the row using the most recent entry.

## Schema changes

When you start LDR on a table, the job will lock the schema, which will prevent any accidental [schema changes]({% link {{ page.version.version }}/online-schema-changes.md %}) that would cause issues for LDR. There are some [supported schema changes](#supported-schema-changes) that you can perform on a replicating table, otherwise it is necessary to stop LDR in order to [coordinate the schema change](#coordinate-other-schema-changes).
Expand Down
Loading