Skip to content

Conversation

@leakonvalinka
Copy link
Member

@leakonvalinka leakonvalinka commented Nov 10, 2025

This PR

  • adds the fatalStatusCode option + env variable

Related Issues

resulted from this issue

Notes

I'm not too happy with how the fatal error is communicated through the different components (received at SyncStreamQueueSource -> FlagStore -> InProcessResolver -> FlagdProvider, respective RpcResolver -> FlagdProvider). It "misuses" the STALE state to differentiate between normal errors and fatal errors. I couldn't find a cleaner solution for this unfortunately, so feedback on this would be highly appreciated!

Will work on the remaining failing tests once we agree on how to proceed!

Follow-up Tasks

How to test

Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
@leakonvalinka leakonvalinka force-pushed the fix/flagd-infinite-connection-retries branch from f7f1d97 to f0a1db2 Compare November 20, 2025 12:22
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: lea konvalinka <lea.konvalinka@dynatrace.com>
@leakonvalinka leakonvalinka changed the title fix(flagd): no retry for certain error codes, implement test steps feat(flagd): introduce fatalStatusCodes option Dec 17, 2025
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
@leakonvalinka leakonvalinka marked this pull request as ready for review December 17, 2025 10:34
@leakonvalinka leakonvalinka requested a review from a team as a code owner December 17, 2025 10:34
Signed-off-by: Konvalinka <lea.konvalinka@dynatrace.com>
Copy link
Contributor

@chrfwow chrfwow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not want to introduce breaking changes into the api by adding a PROVIDER_FATAL type to ProviderEvent, I have two suggestions how we might be able to work around the "misuse" of the stale event:
We could add a isFatal flag to the FlagdProviderEvent to track the type of error. I don't really like it because this could also be set when the event is not an error event, and with this we split up information that should be stored in one place into two places.
Or, we create an enum class ExtendedProviderEvent, which is a copy of ProviderEvent (enums cannot be extended in Java), plus the additional PROVIDER_FATAL field. We would then have to map where needed between the two types (not 100% sure if this will work). I don't like this either, because we would duplicate the ProviderEvent enum

private final BlockingQueue<QueuePayload> outgoingQueue = new LinkedBlockingQueue<>(QUEUE_SIZE);
private final FlagSyncServiceStub flagSyncStub;
private final FlagSyncServiceBlockingStub metadataStub;
private final List<String> fatalStatusCodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do lots of .contains operation on this data structure, a HashSet might be more performant. How many entries do we expect in this list?

Copy link
Member Author

@leakonvalinka leakonvalinka Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's hard for me to estimate, what do the others think? The currently defined default is an empty list

.map(String::trim)
.collect(Collectors.toList()) : defaultValue;
} catch (Exception e) {
return defaultValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should print an info/warn that the env vars are invalid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for this method? Or the other ones too? I'd either leave it or add it in all cases to be consistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should add it everywhere, but in a different PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, sounds good. Should we create a new issue for this or is that overkill?

* Defaults to empty list
*/
@Builder.Default
private List<String> fatalStatusCodes = fallBackToEnvOrDefaultList(Config.FATAL_STATUS_CODES_ENV_VAR_NAME, List.of());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to retry on every error code per default? How is this handled in our other sdks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right, I'll rephrase it to "for which the provider transitions into fatal mode upon first connection". The general retry policy is defined here and is the same for all sdks afaik

if (syncResources.getPreviousEvent() != ProviderEvent.PROVIDER_ERROR) {
onError();
syncResources.setPreviousEvent(ProviderEvent.PROVIDER_ERROR);
case PROVIDER_STALE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the javadoc above the switch, we do now use the STALE state

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do once we agree on a final implementation

@guidobrei
Copy link
Member

guidobrei commented Dec 17, 2025

I'm not too happy with how the fatal error is communicated through the different components (received at SyncStreamQueueSource -> FlagStore -> InProcessResolver -> FlagdProvider...)

This is an implication of our provider design and there is not really something to do about that (in this PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants