-
Notifications
You must be signed in to change notification settings - Fork 874
Fix SynchronizationLockException exception when using IMDS for credentials #4211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…before the adapter plugins register.
| if (!_imdsRefreshFailed && | ||
| _lastRetrievedCredentials.IsExpiredWithin(TimeSpan.Zero)) | ||
| logger.DebugFormat("[Background Timer] Waiting on lock to refresh ECS IMDS"); | ||
| if (_credentialsSemaphore.Wait(_credentialsLockTimeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could _credentialsSemaphore already be disposed here before it tries to wait on it? Yes the exception handler would eat the exception but if we had a check here for _isDisposed we could may be able to avoid an unnecessary exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| { | ||
| try | ||
| logger.DebugFormat("Waiting on lock to refresh ECS IMDS"); | ||
| if (_credentialsSemaphore.Wait(_credentialsLockTimeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this try to wait on a disposed semaphore? Dispose and GetCredentials/GetCredentialsAsync could be called at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the caller try to call a method on a disposed object which is invalid. Same as a user calling read operations on a disposed Stream. The extra protection done for dispose in the RenewCredentials was done because that is called via the timer and there could be timing issues shutting down the timer and disposing the semaphore.
| // lower refresh rate to speed test | ||
| var provider = DefaultInstanceProfileAWSCredentials.CreateTestDefaultInstanceProfileAWSCredentials(stub, TimeSpan.FromSeconds(5)); | ||
|
|
||
| // fix semaphore (production code's semaphore is created with initialCount:0 in this branch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the comment. The code does initialCount: 1. Also what is "this branch"? Note there are a few places where this comment exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the EnsureSemaphoreReleased and ResetPreviousRefreshFailedFlag as they were unnecessary artifacts when I use the AI agent to do the first pass at creating unit tests.
Draft Mode: I put it in draft mode because I want to do some testing on an actual EC2 instance to confirm the behavior of the unit tests.
Description
The
DefaultInstanceProfileAWSCredentialsused as the underlying "singleton" across service clients to fetch AWS credentials from EC2 IMDS used aReaderWriterLockSlimto control the number of threads accessing IMDS. In V3 this was okay because only the sync version was implemented but in V4 we added the async version andReaderWriterLockSlimis not valid to use across async/await call contexts. I also noticed in the previous logic within the scope of the reader lock it would also initiate a fetch credentials fetch essential the "write" logic that should have only been done when a Write lock was obtained.I reworked the logic to rely on a
SemaphoreSlimto control access and fixed it so the fetch credentials was only done when a thread had acquired the semaphore.The reason more users haven't run into the issue since V4 was GA last spring is generally the
GetCredentialsandGetCredentialsAsyncshould only be returning credentials that were previously fetched during the background call toRenewCredentials. TheRenewCredentialsmethod uses the sync calls. When @msab-john ran into the issue there must have been some other issue between the EC2 instance and IMDS that caused the backgroundRenewCredentialsto fail to acquire credentials and then whenGetCredentialsAsyncwas called it reverted to it's failsafe of directly trying to get credentials that triggered theSynchronizationLockExceptionI also did some refactoring of the
DefaultInstanceProfileAWSCredentialsto make it more friendly for unit testing.Motivation and Context
#4199
Testing
Add new unit tests
Dry Run: Success (DRY_RUN-6ef5e975-cba6-4794-939b-83e6620f3b50)
Direct EC2 Testing: Success
EC2 Testing
To confirm behavior in the real world I ran the following program continually that used 50 tasks with half using async and the other half using sync pattern to constantly fetch credentials. I also used our CloudWatch Logs .NET provider to send the logs to CloudWatch for monitoring. The library also added pressure of making service calls with the IMDS credentials to CloudWatch Logs.
The first run was successful with the background timer calling
RenewCredentialsalways taking care of making sure the credentials were refreshed. In this case the locking logic ofGetCredentialsandGetCredentialsAsyncwere never exercised because the locking logic in those methods is only used as a failsafe incase something goes wrong with the background timer andRenewCredentials.To exercise the locking logic of
GetCredentialsandGetCredentialsAsyncI deployed a modified version ofDefaultInstanceProfileAWSCredentialsthat disabled the background timer so the locking logic in theGetCredentialsandGetCredentialsAsyncwould be used. The logs showed the locking logic ran correctly with all the of tasks blocking on the Semaphore and only one actually doing the fetch to IMDS. All the other tasks would short circuit out of the lock once they got it and saw another task had already fetched the locks. Below is a sample of the logs during this time period.