Skip to content

Conversation

@lukeknep
Copy link
Contributor

@lukeknep lukeknep commented Dec 23, 2025

DO NOT MERGE. Feedback requested first.

What does this PR do?

Corrects errors and clears up common confusion points around our RPO, RTO, and SLA.

  • Explain the difference between RTO and SLA, and why a 20-minute RTO can still meet a 99.99% SLA.
  • Clarify that Temporal-initiated failovers must be enabled for the RTO to apply.
  • Clarify that MRR and MCR are still protected against AZ failures and cell failures.
  • Fixed the "8-hour RPO / RTO" for non-HA workloads

Internal Note on the previously-stated 8-hour RTO / RPO for non-HA Namespaces:

  • an "8-hour RTO" doesn't make sense when we are entirely dependent on the underlying cloud infrastructure -- if the infra is down for 24 hours, there's no way we can make an 8-hour RTO.
  • Conversely, if the infra is only down for 20 minutes, then an 8-hour RTO may be too long.
  • Additionally, the 8-hour RPO needs to be carefully explained, as it is not relevant to most outages; most outages historically have not caused data corruption. But if a customer just reads "8-hour RPO," they might erroneously think, "oh no, if the region has an incident like the AWS us-east-1 incident, I may lose 8 hours of data."

Notes to reviewers

Todo items:

[ ] Must hear back from Eng re: what our RTO and RPO are for Same-region Replication
[ ] Must get alignment with Eng stakeholders @sergeybykov and @meiliang86 that this is an accurate framing of our RTO and RPO, especially re: the 8-hour RTO / RPO previously stated.
[ ] Determine whether we should discuss conflict resolution when talking about the RPO. Details in Slack

@lukeknep lukeknep requested review from a team and bechols as code owners December 23, 2025 21:37
@vercel
Copy link

vercel bot commented Dec 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
temporal-documentation Ready Ready Preview, Comment Dec 26, 2025 7:40pm

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2025

📖 Docs PR preview links

@lukeknep lukeknep changed the title [WIP / Feedback requested] Rewriting the 'RPO and RTO' page to clear up common confusion [Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion Dec 23, 2025
Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved.
- "Temporal-initiated failovers:" Also known as "automatic failovers," these failovers are initiated by Temporal's tooling and/or on-call engineers on Namespaces that have High Availability enabled. **Temporal highly recommends keeping Temporal-initiated failovers enabled,** which is the default for all Namespaces with High Availability features. Users can still trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. When Temporal-initiated failovers are disabled on a Namespace, Temporal's RTO for that Namespace is unbounded (it is dependent on how long the underlying outage lasts)

Temporal Cloud strives to maintain a P95 [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) of less than 1 minute.
Copy link
Contributor Author

@lukeknep lukeknep Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this bit because

  1. I'm not sure p95 is good enough. This could be read as "up to 5% of Namespaces could be above the 1-minute RPO at any given moment."
  2. We already say we have a 1-minute RPO. I don't think we need additional standards / goals to be publicly stated. They would only add confusion. Let's state our main goal (RPO) and stand by it.


Internally, our components are distributed across a minimum of three availability zones per region.
We implement a cell architecture.
We implement a [cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've blogged about Cells https://temporal.io/blog/two-years-in#scale, https://temporal.io/blog/building-durable-cloud-control-systems-with-temporal#implementing-the-data-plane-a-cell-based-architecture - we should have a first-class definition of what a Temporal Cloud cell is in our docs. Not to expand scope here to adding a full Cloud architecture page (although I do think we should have one eventually) - maybe add a section on /cloud/service-availability and link to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice call out about the cells blog. I've pulled content from that and put it on the SLA page as suggested.


Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.
In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPO and RTO are how we measure, and low values for RPO and RTO are what we strive for. Could tighten this phrasing.

Copy link
Contributor Author

@lukeknep lukeknep Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not accurate (or at least, if I understand your comment correctly, that's not how the terms are used in the industry)

  • Recovery Point/Time "Objective": this is the goal we have for all outages. That's why the term has "Objective" in it's name.
  • recovery time / recovery point: this is the actual observed values in a given outage. I could say "observed recovery time" or "achieved recovery time," but that gets bloated.

I wanted to make the distinction between the two terms really clear in the doc. If it's not clear, then I need to reword it.

P.S. Confirmed the industry standard with GPT 5.1:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the current wording boils down to "we strive for RPO", which doesn't really have any informational content without the actual numerical objective. "We strive for zero RPO" or "we strive for sub 20 minute RTO" is informative.

Trying this wording with a similar concept: "Uptime is the objective that Temporal strives to meet for availability (service accessibility)". I think it's clearer/more informative to say something like "Temporal Cloud measures availability in terms of service uptime, and has a 99.99% availability SLO and 99.% availability SLA"

All that said: happy to merge as-is.

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.
In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.
These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds rough. Can we say RPO + RTO aren't part of the availability SLA instead (and link to the SLA page)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.
Temporal Cloud's RPO and RTO are complementary to but separate from the [availability SLA](/cloud/sla)."

In case of an outage in the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Executions can be started.

## High Availability, Regional Failure
The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breakdown is great!

Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including:

**Recovery Time Objective (RTO) - 20 minutes**
- Best-in-class data replication technology that keeps the replica up to date with the active.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could link to https://www.youtube.com/watch?v=mULBvv83dYM where Liang gets into more specifics


**All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch.
ElasticSearch is eventually consistent, but this does not impact our RPO as there is no data loss.
- You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we suggest or link to specific guidance on how to do this?

lukeknep and others added 2 commits December 26, 2025 11:39
Co-authored-by: Ben Echols <benjamin.echols@temporal.io>
Co-authored-by: Ben Echols <benjamin.echols@temporal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants