Skip to content

Conversation

@normj
Copy link
Member

@normj normj commented Dec 9, 2025

Draft Mode: I put it in draft mode because I want to do some testing on an actual EC2 instance to confirm the behavior of the unit tests.

Description

The DefaultInstanceProfileAWSCredentials used as the underlying "singleton" across service clients to fetch AWS credentials from EC2 IMDS used a ReaderWriterLockSlim to control the number of threads accessing IMDS. In V3 this was okay because only the sync version was implemented but in V4 we added the async version and ReaderWriterLockSlim is not valid to use across async/await call contexts. I also noticed in the previous logic within the scope of the reader lock it would also initiate a fetch credentials fetch essential the "write" logic that should have only been done when a Write lock was obtained.

I reworked the logic to rely on a SemaphoreSlim to control access and fixed it so the fetch credentials was only done when a thread had acquired the semaphore.

The reason more users haven't run into the issue since V4 was GA last spring is generally the GetCredentials and GetCredentialsAsync should only be returning credentials that were previously fetched during the background call to RenewCredentials. The RenewCredentials method uses the sync calls. When @msab-john ran into the issue there must have been some other issue between the EC2 instance and IMDS that caused the background RenewCredentials to fail to acquire credentials and then when GetCredentialsAsync was called it reverted to it's failsafe of directly trying to get credentials that triggered the SynchronizationLockException

I also did some refactoring of the DefaultInstanceProfileAWSCredentials to make it more friendly for unit testing.

Motivation and Context

#4199

Testing

Add new unit tests
Dry Run: Success (DRY_RUN-6ef5e975-cba6-4794-939b-83e6620f3b50)
Direct EC2 Testing: Success

EC2 Testing

To confirm behavior in the real world I ran the following program continually that used 50 tasks with half using async and the other half using sync pattern to constantly fetch credentials. I also used our CloudWatch Logs .NET provider to send the logs to CloudWatch for monitoring. The library also added pressure of making service calls with the IMDS credentials to CloudWatch Logs.

using Amazon.Runtime.Credentials;
using Microsoft.Extensions.Logging;

var loggerFactory = LoggerFactory.Create(builder =>
{
    builder
        .AddConsole()
        .AddAWSProvider(new AWS.Logger.AWSLoggerConfig
        {
            LogGroup = "RefreshCredentialsTest"

        })
        .SetMinimumLevel(LogLevel.None)
        .AddFilter((category, level) =>
            (category != null && (category.Contains("DefaultInstanceProfileAWSCredentials", StringComparison.OrdinalIgnoreCase)) || string.Equals(category, "RefreshCredentialsTest", StringComparison.Ordinal)) && level >= LogLevel.Debug);
});

loggerFactory.ConfigureAWSSDKLogging();

var logger = loggerFactory.CreateLogger("RefreshCredentialsTest");

var awsCredentials = DefaultAWSCredentialsIdentityResolver.GetCredentials();

logger.LogInformation("Type of AWSCredentials: {Type}", awsCredentials.GetType().FullName);

var tasks = new List<Task>();

ulong counter = 0;
var random = new Random();

for (var i = 0; i < 50; i++)
{
    tasks.Add(Task.Run(async () =>
    {
        while(true)
        {
            if (i % 2 == 0)
            {
                await awsCredentials.GetCredentialsAsync();
            }
            else
            {
                awsCredentials.GetCredentials();
            }
            await Task.Delay(random.Next(100, 300));

            var newValue = Interlocked.Increment(ref counter);
            if (newValue % 10000 == 0)
            {
                logger.LogInformation($"Fetched credentials {newValue} times.");
            }
        }
    }));
};

await Task.WhenAll(tasks.ToArray());

The first run was successful with the background timer calling RenewCredentials always taking care of making sure the credentials were refreshed. In this case the locking logic of GetCredentials and GetCredentialsAsync were never exercised because the locking logic in those methods is only used as a failsafe incase something goes wrong with the background timer and RenewCredentials.

To exercise the locking logic of GetCredentials and GetCredentialsAsync I deployed a modified version of DefaultInstanceProfileAWSCredentials that disabled the background timer so the locking logic in the GetCredentials and GetCredentialsAsync would be used. The logs showed the locking logic ran correctly with all the of tasks blocking on the Semaphore and only one actually doing the fetch to IMDS. All the other tasks would short circuit out of the lock once they got it and saw another task had already fetched the locks. Below is a sample of the logs during this time period.

2025-12-09T15:28:04.001-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.001-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.001-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.001-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.001-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Fetching credentials from ECS IMDS
2025-12-09T15:28:04.009-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.010-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Waiting on lock to refresh ECS IMDS
2025-12-09T15:28:04.019-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.020-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Obtained lock to refresh ECS IMDS
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Another thread has refreshed credentials and reusing those credentials
2025-12-09T15:28:04.021-08:00 [Debug] Amazon.Runtime.DefaultInstanceProfileAWSCredentials: Releasing lock after refreshing ECS IMDS

if (!_imdsRefreshFailed &&
_lastRetrievedCredentials.IsExpiredWithin(TimeSpan.Zero))
logger.DebugFormat("[Background Timer] Waiting on lock to refresh ECS IMDS");
if (_credentialsSemaphore.Wait(_credentialsLockTimeout))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could _credentialsSemaphore already be disposed here before it tries to wait on it? Yes the exception handler would eat the exception but if we had a check here for _isDisposed we could may be able to avoid an unnecessary exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
try
logger.DebugFormat("Waiting on lock to refresh ECS IMDS");
if (_credentialsSemaphore.Wait(_credentialsLockTimeout))
Copy link
Contributor

@boblodgett boblodgett Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this try to wait on a disposed semaphore? Dispose and GetCredentials/GetCredentialsAsync could be called at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the caller try to call a method on a disposed object which is invalid. Same as a user calling read operations on a disposed Stream. The extra protection done for dispose in the RenewCredentials was done because that is called via the timer and there could be timing issues shutting down the timer and disposing the semaphore.

// lower refresh rate to speed test
var provider = DefaultInstanceProfileAWSCredentials.CreateTestDefaultInstanceProfileAWSCredentials(stub, TimeSpan.FromSeconds(5));

// fix semaphore (production code's semaphore is created with initialCount:0 in this branch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment. The code does initialCount: 1. Also what is "this branch"? Note there are a few places where this comment exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the EnsureSemaphoreReleased and ResetPreviousRefreshFailedFlag as they were unnecessary artifacts when I use the AI agent to do the first pass at creating unit tests.

@normj normj marked this pull request as ready for review December 10, 2025 17:25
@normj normj merged commit 62c81ad into development Dec 10, 2025
4 checks passed
@normj normj deleted the normj/fix-imds-locking branch December 10, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants