Troubleshooting

This section contains information about how to monitor the system and how to troubleshoot specific issues that may occur with RES.

General Debugging and Monitoring

This section contains information about where information can be found within RES.

Useful log and event information sources

There are various sources of information retained that can be referenced for troubleshooting and monitoring uses.

Log files on the environment Amazon EC2 instances

Log files exist on the Amazon EC2 instances in use by RES. The SSM Session Manager can be used to open a session to the instance for examining these files.

On infrastructure instances such as the cluster-manager and vdc-controller, application and other logs can be found at the following locations.

/opt/idea/app/logs/application.log
/root/bootstrap/logs/
/var/log/
/var/log/sssd/
/var/log/messages
/var/log/user-data.log
/var/log/cloud-init.log
/var/log/cloud-init-output.log

On a Linux virtual desktop, the following contain useful log files

/var/log/dcv/
/root/bootstrap/logs/userdata.log
/var/log/messages

On Windows virtual desktop instances logs can be found at

PS C:\ProgramData\nice\dcv\log
PS C:\ProgramData\nice\DCVSessionManagerAgent\log

On Windows, some applications logging can be found at:

PS C:\Program Files\NICE\DCV\Server\bin

On Windows, the NICE DCV certificate files can be found in:

C:\Windows\System32\config\systemprofile\AppData\Local\NICE\dcv\

Amazon CloudWatch Log Groups

The Amazon EC2 and AWS Lambda compute resources will log information to Amazon CloudWatch Log Groups. The log entries within them can provide useful information when troubleshooting potential issues or for general information.

Those groups will be named as follows:

/aws/lambda/< envname >-/ - lambda related
/< envname >/
- cluster-manager/ - main infrastructure host
- vdc/ - virtual desktop related
  - dcv-broker/ - desktop related
  - dcv-connection-gateway/ - desktop related
  - controller/ - main desktop controller host
  - dcv-session/ - desktop session related

When examining log groups, it can be helpful to filter using upper and lower case strings such as the following. This will output only those messages containing the noted strings.

?"ERROR" ?"error"

Another method of monitoring for issues is to create Amazon CloudWatch Dashboards that contain widgets displaying the data of interest.

An example is to create a widget that counts the occurrence of the strings error and ERROR and graph them as lines. This method makes it easier to detect the occurrence of potential issues or trends indicating a pattern change has occurred.

The following is an example of that for the infrastructure hosts. To use this, concatenate the query lines and replace the < envname > and < region > attributes with the appropriate values.

{
    "widgets": [
        {
            "type": "log",
            "x": 0,
            "y": 0,
            "width": 24,
            "height": 6,
            "properties": {
                "query": "SOURCE '/<envname>/vdc/controller' | 
                    SOURCE '/<envname>/cluster-manager' | 
                    SOURCE '/<envname>/vdc/dcv-broker' | 
                   SOURCE '/<envname>/vdc/dcv-connection-gateway' | 
                    fields @timestamp, @message, @logStream, @log\n| 
                    filter @message like /(?i)(error|ERROR)/\n| 
                    sort @timestamp desc| 
                    stats count() by bin(30s)",
                "region": "<region>",
                "title": "infrastructure hosts",
                "view": "timeSeries",
                "stacked": false
            }
        }
    ]
}

An example of the Dashboard might appear as follows.

CloudFormation Stacks

The CloudFormation stacks created during environment creation contain resources, event, and output information associated with the configuration of the environment.

For each of the stacks, the Events, Resources, and Outputs tab can be referred to for information about the stacks.

RES stacks

< envname >-bootstrap
< envname >-cluster
< envname >-metrics
< envname >-directoryservice
< envname >-identity-provider
< envname >-shared-storage
< envname >-cluster-manager
< envname >-vdc
< envname >-bastion-host

Demo Environment Stack (If you are deploying a demo environment and do not have these external resources available, you can use AWS High Performance Compute recipes to generate resources for a demo environment.)

< envname >
< envname >-Networking
< envname >-DirectoryService
< envname >-Storage
< envname >-WindowsManagementHost

System failures due to an issue and reflected by Amazon EC2 Auto Scaling Group Activity

If the RES UIs are indicating server errors, the cause may be an application software or other issue.

Each of the infrastructure Amazon EC2 instance autoscaling groups (ASGs) contain an Activity tab that can be useful for detecting scaling activity for the instances. If UI pages note any errors or are not accessible, check the Amazon EC2 console for multiple terminated instances and check the Auto Scaling Group Activity tab for the related ASG to determine if Amazon EC2 instances are cycling.

If so, use the related Amazon CloudWatch log group for the instance to determine if errors are being logged that might indicate the cause of the issue. It may also be possible to use the SSM Session console to open a session to a running instance of that type and examine the log files on the instance to determine a cause before the instance is marked as unhealthy and terminated by the ASG.

The ASG console may show activity similar to the following if this issue is occurring.

res-targetgroup-01

Typical Amazon EC2 Console Appearance

This section contains screenshots of the system operating in various states.

Infrastructure hosts

The Amazon EC2 console, when no desktops are running, will typically look similar to the following. The instances that are show are the RES infrastructure Amazon EC2 hosts. The prefix in the name of the instances will be the RES environment name.

res-ec2-01

Infrastructure hosts and virtual desktops

The Amazon EC2 console when virtual desktops are running, they will appear similar to the following. In this case, the virtual desktops are noted in red. The suffix to the instance name will be the user that created the desktop. The name in the center will be the Session Name set at launch time and will be either be the default MyDesktop or the name set by the user.

res-ec2-02

Hosts in a terminated state

When the Amazon EC2 console shows terminated instances, they will generally be desktop hosts that have been terminated. If the console includes infrastructure hosts in a terminated state, particularly if there are multiple of the same type, that may indicate a system issue in progress.

The following show desktop instances that have been terminated.

res-ec2-03

Useful Active Directory (AD) related commands for reference

The following are examples of ldap related commands that can be entered on infrastructure hosts to view AD configuration related information. The domain and other parameters used should reflect those entered at environment creation time.

ldapsearch "(cn=AWS Delegated Add Workstations To Domain Users)" -x -h corp.res.com -b "DC=corp,DC=res,DC=com" -D "CN=Admin,OU=Users,OU=CORP,DC=corp,DC=res,DC=com" -w <password>

ldapsearch "(&(objectClass=group))" -x -h corp.res.com -b "DC=corp,DC=res,DC=com" -D "CN=Admin,OU=Users,OU=CORP,DC=corp,DC=res,DC=com" -w <password>

Windows DCV debugging

On a Windows desktop, list the session associated with it using the following

PS C:\Windows\System32\config\systemprofile\AppData\Local\NICE\dcv> & 'C:\Program Files\NICE\DCV\Server\bin\dcv.exe'list-sessions Session: 'a7953489-9dbf-492b-8135-7709dccc4cab' (owner:admin2 type:console name:windows1)

Find NICE DCV Version Information

NICE DCV is utilized for virtual desktop sessions. https://aws.amazon.com/hpc/dcv/. The following are examples of determining the version of the DCV software installed.

Linux

[root@ip-10-3-157-194 ~]# /usr/bin/dcv version

This product is protected by copyright and licenses restricting use, copying, distribution, and decompilation.

Windows

PS C:\Windows\System32\config\systemprofile\AppData\Local\NICE\dcv> & 'C:\Program Files\NICE\DCV\Server\bin\dcv.exe' version

This product is protected by copyright and licenses restricting use, copying, distribution, and decompilation.

Issue RunBooks

The following section contains issues that may occur, how to detect them, and suggestions on how to resolve the issue.

I am not authorized to perform an action in Research and Engineering Studio on AWS

I am not authorized to perform iam:PassRole

If you receive an error that you're not authorized to perform the iam:PassRole action, your policies must be updated to allow you to pass a role to RES.

Some AWS services allow you to pass an existing role to that service instead of creating a new service role or service-linked role. To do this, you must have permissions to pass the role to the service.

The following example error occurs when an IAM user named marymajor tries to use the console to perform an action in RES. However, the action requires the service to have permissions that are granted by a service role. Mary does not have permissions to pass the role to the service.

User: arn:aws:iam::123456789012:user/marymajor is not authorized to perform: iam:PassRole

In this case, Mary's policies must be updated to allow her to perform the iam:PassRole action. If you need help, contact your AWS administrator. Your administrator is the person who provided you with your sign-in credentials.

I want to allow people outside of my AWS account to access my Research and Engineering Studio on AWS resources

You can create a role that users in other accounts or people outside of your organization can use to access your resources. You can specify who is trusted to assume the role. For services that support resource-based policies or access control lists (ACLs), you can use those policies to grant people access to your resources.

To learn more, consult the following:

To learn how to provide access to your resources across AWS accounts that you own, see Providing access to an IAM user in another AWS account that you own in the IAM User Guide.
To learn how to provide access to your resources to third-party AWS accounts, see Providing access to AWS accounts owned by third parties in the IAM User Guide.
To learn how to provide access through identity federation, see Providing access to externally authenticated users (identity federation) in the IAM User Guide.
To learn the difference between using roles and resource-based policies for cross-account access, see How IAM roles differ from resource-based policies in the IAM User Guide.

Storage

I created file system through RES but it doesn’t mount on the VDI hosts

The file systems need to be in “Available” state before they can be mounted by VDI hosts. Follow steps below to validate file system is in needed state

Amazon EFS

Go to Amazon EFS console [us-east-1]
Check File system state is Available
If File system state is not Available, wait before launching VDI hosts

Amazon FSx ONTAP

Go to Amazon FSx console [us-east-1]
Check Status is Available
If Status is not Available, wait before launching VDI hosts

I onboarded file system through RES but it doesn’t mount on the VDI hosts

The file systems onboarded on RES should have the needed security group rules configured to allow VDI hosts to mount the file systems. As these file systems are created externally to RES, RES doesn’t manage the associated security group rules.

The security group associated with the onboarded file systems should allow following inbound traffic

NFS traffic (port: 2049) from the linux VDC hosts
SMB traffic (port: 445) from the windows VDC hosts

I am not able to read/write on Amazon FSx for NetApp ONTAP from VDI hosts

ONTAP supports UNIX, NTFS and MIXED security style for the volumes. The security styles determine the type of permissions ONTAP uses to control data access and what client type can modify these permissions.

For example, if a volume uses UNIX security style, SMB clients can still access data (provided that they properly authenticate and authorize) due to the multi-protocol nature of ONTAP. However, ONTAP uses UNIX permissions that only UNIX clients can modify using native tools.

Example permission handling use cases

Using UNIX style volume with Linux workloads Permissions can be configured by the sudoer for other users. For eg. following would give all members of <group-ID> full read/write permissions on /<project-name> directory

sudo chown root:<group-ID> /<project-name> sudo chmod 770 /<project-name>

Using NTFS style volume with Linux and Windows workloads Permissions can be configured on the file share using share properties. For eg. following would give pentest_user_01 full control permissions on the file share myshare.

If the volume is going to be used by both Linux and Windows client we need to set up a name mapping on SVM that will associate any Linux user name to the same user name with the NetBIOS domain name format of domain\username. This is needed to translate between Linux and Windows users. For reference see Enabling multiprotocol workloads with Amazon FSx for NetApp ONTAP

I created Amazon FSx for NetApp ONTAP from RES but it did not join my domain

Currently, if you are creating Amazon FSx for NetApp ONTAP from RES console, the file system gets provisioned but it does not join the domain. To join the created ONTAP file system SVM to your domain, please see here Joining SVMs to a Microsoft Active Directory and follow the steps on Amazon FSx console [us-east-1]. Make sure required permissions are delegated to the Amazon FSx Service Account in AD. Once the SVM joins the domain successfully, go to SVM Summary > Endpoints > SMB DNS name, copy the DNS name as it will be needed later.

After joining to domain, edit the SMB DNS config key in cluster settings DynamoDB table:

Go to Amazon DynamoDB console [us-east-1] 2.** Tables >** select -cluster-settings
Explore table items > expand Filters, and put following filter
1. Attribute name - key
2. Condition - Equal to
3. Value - shared-storage..fsx_netapp_ontap.svm.smb_dns
Select the returned item > Actions > Edit item
Update the value to string value (SMB DNS name copied earlier) 6.** Save and close**

Additionally, ensure security group associated with the file system allows traffic as recommended on File System Access Control with Amazon VPC. New VDI hosts using the file system will now be able to mount the domain joined SVM and file system.

Alternative, you may onboard existing file system which is already joined to your domain using RES Onboard File System capability from Environment Management > File Systems > Onboard File System

Snapshots

A Snapshot has a status of Failed

On the RES Snapshots page, if a snapshot has a status of Failed, the cause can be determined by going to the Amazon CloudWatch log group for the cluster-manager for the time that the error occurred.

[2023-11-19 03:39:20,208] [INFO] [snapshots-service] creating snapshot in S3 Bucket: asdf at path s31 [2023-11-19 03:39:20,381] [ERROR] [snapshots-service] An error occurred while creating the snapshot: An error occurred (TableNotFoundException) when calling the UpdateContinuousBackups operation: Table not found: res-demo.accounts.sequence-config

A Snapshot fails to apply with logs indicating that the tables could not be imported.

If a snapshot taken from a previous env fails to apply in a new env, look into the CloudWatch logs for cluster-manager to identify the issue. If the issue mentions that the required tables cloud not be imported, verify that the snapshot is in a valid state.

Download the metadata.json file and verify that the ExportStatus for the various tables is in COMPLETED status. Ensure the various tables have the ExportManifest field set. If you do not find the above fields set, the snapshot is in an invalid state and cannot be used with the apply snapshot functionality.
After initiating a snapshot creation, ensure that the Snapshot status turns to COMPLETED in RES. The Snapshot creation process takes upto 5 to 10 minutes. Reload or revisit the Snapshot Management page to ensure the Snapshot was created successfully. This will ensure that the created snapshot is in valid state.

Infrastructure

Load balancer target groups without healthy instances

If issues such as server error messages are appearing in the UI or desktop sessions cannot connect, that may indicate an issue in the infrastructure Amazon EC2 instances.

The methods to determine the source of the issue are to first check the Amazon EC2 console for any Amazon EC2 instances that appear to be repeatedly terminating and being replaced by new instances. If that is the case, checking the Amazon CloudWatch logs may determine the cause.

Another method is check the load balancers in the system. An indication that there may be system issues is if any load balancers, found on the Amazon EC2 console, do not show any healthy instances registered.

An example of a normal appearance is in the following:

res-targetgroup-01

If the Healthy entry is 0, that indicates that no Amazon EC2 instance is available to process requests.

If the Unhealthy entry is non-0, that indicates that an Amazon EC2 instance may be cycling. This can be due to the installed applications software not passing health checks.

If both Healthy and Unhealthy entries are 0, that indicates a potential network misconfiguration. An example might be the public and private subnets do not have corresponding AZs. If this condition occurs, there may be additional text on the console indicating that network state exists.

Launching Virtual Desktops

A virtual desktop that was previously working is no longer able to connect successfully

If a desktop connection closes or can no longer be connected to the issue may be due to the underlying Amazon EC2 instance failing or the Amazon EC2 instance has been terminated or stopped outside of the RES environment. The Admin UI status may continue to show a ready state but attempts to connect to it fail.

The Amazon EC2 Console should be used to determine if the instance has been terminated or stopped. If stopped, try starting it again. If the state is terminate, another desktop will have to be created. Any data that was stored on the user home directory should still be available when the new instance starts.

If the previous failed instance still appears on the Admin UI, it may need to be terminated using the Admin UI.

I am only able to launch 5 virtual desktops

The default limit for the number of virtual desktops that a user can launch is 5. This can be changed by an admin using the Admin UI as follows

go to Desktop Settings
select the **Server **tab
in the DCV Session panel, click on the edit icon on the right
change the value in Allowed Sessions Per User to the desired new value
click Submit
refresh the page to confirm that the new setting is in place

Desktop Windows connect attempts fail with “The connection has been closed. Transport error”

If a Windows desktop connection fails with the UI error “The connection has been closed. Transport error”, the cause can be due to an issue in the DCV server software related to certificate creation on the Windows instance.

The Amazon CloudWatch log group <envname>/vdc/dcv-connection-gateway may log the connection attempt error with messages similar to the following

Nov 24 20:24:27.631 DEBUG HTTP:Splicer Connection{id=9}: Websocket{session_id="1291e75f-7816-48d9-bbb2-7371b3b911cd"}: Resolver lookup{client_ip=Some(52.94.36.19) session_id="1291e75f-7816-48d9-bbb2-7371b3b911cd" protocol_type=WebSocket extension_data=None}:NoStrictCertVerification: Additional stack certificate (0): [s/n: 0E9E9C4DE7194B37687DC94D2C0F5E94AF0DD57E]

Nov 24 20:25:15.384 INFO HTTP:Splicer Connection{id=21}:Websocket{ session_id="d1d35954-f29d-4b3f-8c23-6a53303ebc3f"}: Connection initiated error: unreachable, server io error Custom { kind: InvalidData, error: General("Invalid certificate: certificate has expired (code: 10)") }

Nov 24 20:25:15.384 WARN HTTP:Splicer Connection{id=21}: Websocket{session_id="d1d35954-f29d-4b3f-8c23-6a53303ebc3f"}: Error in websocket connection: Server unreachable: Server error: IO error: unexpected error: Invalid certificate: certificate has expired (code: 10)

If this occurs, a resolution may be to use the SSM Session Manager to open a connection to the Windows instance and remove the following 2 certificate related files.

PS C:\Windows\system32\config\systemprofile\AppData\Local\NICE\dcv> dir

`Directory: C:\Windows\system32\config\systemprofile\AppData\Local\NICE\dcv`

Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 8/4/2022 12:59 PM 1704 dcv.key -a---- 8/4/2022 12:59 PM 1265 dcv.pem

The files should be automatically recreated and a subsequent connection attempt may be successful.

If this method resolves the issue and if new launches of Windows desktops produce the same error, use the Create Software Stack function to create a new Windows software stack of the fixed instance with the regenerated certificate files. That may produce a Windows software stack that can be used for successful launches and connections.

VDIs stuck in `Provisioning` state

If a desktop launch remains in the provisioning state in the Admin UI, this may be due to several reasons.

Methods to determine the cause including examining log files on the desktop instance for errors that might be causing the issue. This document contains a list of log files and Amazon CloudWatch log groups that contain relevant information in the section labelled Useful log and event information sources.

The following are potential causes of this issue.

AMI id used has been registered as a software-stack but is not supported by RES
- The bootstrap provisioning script failed to complete because the AMI does not have the expected configuration or tooling required. The log files on the instance, such as /root/bootstrap/logs/ on a Linux instance may contain useful information regarding this. AMIs ids taken from the AWS Marketplace may not work for RES desktop instances. They require testing to confirm if they are supported.
User data scripts are not executed when the Windows virtual desktop instance is launched from a custom AMI
- By default, user data scripts run one time when an EC2 instance is launched. If you create an AMI from an existing virtual desktop instance, register a software stack with the AMI and try to launch another virtual desktop with this software stack, user data scripts will not run on the new virtual desktop instance. To fix the issue, open a PowerShell command window as Administrator on the original virtual desktop instance you used to create AMI, and run the following command: C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 –Schedule. Then create a new AMI from the instance. You can use the new AMI to register software stacks and launch new virtual desktops afterwards. Note that you may also run the same command on the instance that remains in the provisioning state and reboot the instance to fix the virtual desktop session, but you will run into the same issue again when launching another virtual desktop from the misconfigured AMI.

VDIs get into `Error` state after launching

Possible issue 1: The home filesytem has directory for the user with different POSIX permissions.

If the following scenarios are true, this could be the issue you are facing,

RES Version deployed in 2024.01 or higher
The attribute for EnableLdapIDMapping during deployment of the RES stack was set to True
The home filesystem specified during RES stack deployment was used in version prior to RES 2024.01 or was used in a previous environmnet with EnableLdapIDMapping set to False

Resolution steps: Delete the user directories in the filesystem.

SSM to the cluster-manager host
cd /home
ls - should list directories with directory names that match usernames, such as admin1 , admin2 .. etc.
Delete the directories, sudo rm -r 'dir_name'. Do not delete the ssm-user and ec2-user directories
If the users are already synced to the new env, delete the user's from the user's DDB table (expect clusteradmin)
Initiate AD sync - execute sudo /opt/idea/python/3.9.16/bin/resctl ldap sync-from-ad in cluster-manager EC2.
Reboot the VDI instance in Error state from the RES webpage. Validate that the VDI is gets into Ready state in around 20 mins.

Virtual Desktop Component

Amazon EC2 instance is repeatedly showing terminated in the console

If an infrastructure instance is repeatedly showing as terminated in the Amazon EC2 console, the cause is potentially configuration related. The potential cause may be dependent on the infrastructure instance type. The following are methods to determine the cause.

If the vdc-controller instance is showing repeated terminated states in the Amazon EC2 console, this can be due to an incorrect Secret tag. Secrets that are maintained by RES have tags that are used as a part of the IAM access control policies attached to the infrastructure Amazon EC2 instances. If the vdc-controller is cycling and the following error appears in the CloudWatch log group, the cause may be that a secret has not been tagged correctly. Note that the secret needs to be tagged with the following:

{ "res:EnvironmentName": "< envname >" # e.g. "res-demo" "res:ModuleName": "virtual-desktop-controller" }

The Amazon CloudWatch log message for this error will appear similar to the following.

An error occurred (AccessDeniedException) when calling the GetSecretValue operation: User: arn:aws:sts::160215750999:assumed-role/<envname>-vdc-gateway-role-us-east-1/i-043f76a2677f373d0 is not authorized to perform: secretsmanager:GetSecretValue on resource: arn:aws:secretsmanager:us-east-1:160215750999:secret:Certificate-res-bi-Certs-5W9SPUXF08IB-F1sNRv because no identity-based policy allows the secretsmanager:GetSecretValue action

Check the tags on the Amazon EC2 instance and confirm that the match the above list.

vdc-controller instance is cycling due to failing to join AD / eVDI module shows Failed API Health Check

If the eVDI module is failing it’s health check, it will show the following in the Environment Status section.

res-models

In this case, the general path for debugging is to look into the cluster-manager Cloudwatch (Log group named /cluster-manager) logs.

Possible issues:

If the logs contain the text Insufficient permissions, make sure the ServiceAccount username given at time of res stack creation is spelled correctly.

Example log line: Insufficient permissions to modify computer account: CN=IDEA-586BD25043,OU=Computers,OU=RES,OU=CORP,DC=corp,DC=res,DC=com: 000020E7: AtrErr: DSID-03153943, #1: 0: 000020E7: DSID-03153943, problem 1005 (CONSTRAINT_ATT_TYPE), data 0, Att 90008 (userAccountControl):len 4 >> 432 ms - request will be retried in 30 seconds
- You can access the ServiceAccount Username provided during RES deployment from SecretsManager. Find the corresponding secret in Secrets manager and click Retrieve Plain text. If the Username is incorrect, update the secret value by clicking the edit button. Terminate the current cluster-manager and vdc-controller instances. The new instances will come up in a stable state.
- Note: The username must be ServiceAccount if you are utilizing the resources created by the provided external resources stack. If the DisableADJoin parameter was set to False during your deployment of RES, ensure the ServiceAccount user has permissions to create Computer objects in the AD.
If the username used was correct, but the logs contain the text Invalid credentials, then the password you entered might be wrong or have expired.

Example log line: {'msgtype': 97, 'msgid': 1, 'result': 49, 'desc': 'Invalid credentials', 'ctrls': [], 'info': '80090308: LdapErr: DSID-0C090569, comment: AcceptSecurityContext error, data 532, v4563'}
- You can read the password you entered during env creation by accessing the secret that stores the password in Secrets Manager. Select the secret (Eg. <env_name>directoryserviceServiceAccountPassword) and click the Retrieve plain text button.
- If the password in the secret is incorrect, you can update the value in the secret by clicking the Edit button. Terminate the current cluster-manager and vdc-controller instances. The new instances will use the updated password and come up in a stable state.
- If the password is correct, it could be that the password has expired in the connected Active Directory. You'll have to first reset the password in the Active Directory and then update the secret. You can reset the user's password in the Active Directory from Directory Service → select the appropriate Directory ID → Click Actions → Reset user password → Fill out the form with the username (eg. ServiceAccount) and the new password. If the newly set password is different from the previous password, update the password in the corresponding Secret Manager secret (Eg. <env_name>directoryserviceServiceAccountPassword). Terminate the current cluster-manager and vdc-controller instances. The new instances will come up in a stable state.

Project does not appear in the pull down when editing the Software Stack to add it

This issue may be related to the following issue associated with syncing the user account with AD. If this issue appears, check the cluster-manager Amazon CloudWatch log group for the error “[user-home-init] account not available yet. waiting for user to be synced” to determine if the cause is the same or related.

cluster-manager Amazon CloudWatch log shows “[user-home-init] account not available yet. waiting for user to be synced” (where account is a user name)

The SQS subscriber is busy and stuck in an infinite loop because it cannot get to the user account. This code is triggered when trying to create a home filesystem for a user during user sync.

The cause of not being able to get to the user account may be that RES was not configured correctly for the AD in use. An example might be that ServiceAccountUsername parameter, when using the BI/RES environment creation, is not the correct value such as using ServiceAccount instead of Admin.

Windows desktop on login attempt says “Your account has been disabled. Please see your administrator”

If the user is unable to log back in to a locked screen, this may indicate that the user has been disabled in the AD configured for RES after having successfully signed on via SSO.

The SSO login should fail if the user account has been disabled in AD.

DHCP Options issues with external/customer AD configuration

In a configuration where the customer AD is being used with RES. If the following error appears in the Windows session. frontend "The connection has been closed" Transport error.

Check the dcv-connection-gateway Amazon CloudWatch log for something similar to the following:

Oct 28 00:12:30.626 INFO HTTP:Splicer Connection{id=263}: Websocket{session_id="96cffa6e-cf2e-410f-9eea-6ae8478dc08a"}: Connection initiated error: unreachable, server io error Custom { kind: Uncategorized, error: "failed to lookup address information: Name or service not known" }

Oct 28 00:12:30.626 WARN HTTP:Splicer Connection{id=263}: Websocket{session_id="96cffa6e-cf2e-410f-9eea-6ae8478dc08a"}: Error in websocket connection: Server unreachable: Server error: IO error: failed to lookup address information: Name or service not known

Oct 28 00:12:30.627 DEBUG HTTP:Splicer Connection{id=263}: ConnectionGuard dropped

If you are using an AD domain controller for your DHCP Options for your own VPC, you need to

Add AmazonProvidedDNS to the two domain controller IPs
Set the domain name to ec2.internal

A example is listed below. Without this configuration, the Windows desktop will give you **Transport error **, because RES/DCV looks for ip-10-0-x-xx.ec2.internal hostname.

res-DHCP

Firefox error MOZILLA_PKIX_ERROR_REQUIRED_TLS_FEATURE_MISSING

When using Firefox web browser, the customer might encounter the error message type MOZILLA_PKIX_ERROR_REQUIRED_TLS_FEATURE_MISSING when attempting to connect to a virtual desktop.

The cause is that the RES web server is set up with TLS + Stapling On but is not responding with Stapling Validation (https://support.mozilla.org/en-US/questions/1372483)

The customer can fix this by using the following instructions: https://really-simple-ssl.com/mozilla_pkix_error_required_tls_feature_missing

Env deletion

res-xxx-cluster stack in ‘DELETE_FAILED’ state and cannot be deleted manually due to ‘Role is invalid or cannot be assumed’ error

If you notice that the ‘res-xxx-cluster’ stack is in ‘DELETE_FAILED’ state and cannot be deleted manually, you can perform the following steps to delete it.

If you see the stack in a ‘DELETE_FAILED’ state, first try to manually delete it. It may pop up a dialog confirming Delete Stack. Click 'Delete'.

Sometimes, even if you delete all the required stack resources, you may still see the message to select resources to retain. In that case, select all the resources as the "resources to retain" and click ‘Delete’.

You may see an error that looks like Role: arn:aws:iam::... is Invalid or cannot be assumed

This means that the role required to delete the stack got deleted first before the stack. To get around this, copy the name of the role. Go to IAM console and create a role with that name using the below parameters

Click Next. Make sure you give the role ‘AWSCloudFormationFullAccess’ and ‘AdministratorAccess’ permissions. Your review page should look like this

Then go back to the CloudFormation console and delete the stack. You should now be able to delete it since you created the role. Finally, go to IAM console and delete the role you created.

Collecting Logs

Logging into EC2 instance from EC2 console

Follow these instructions to login to your Linux EC2 instance
Follow these instructions to login to your Windows EC2 instance. Then open Windows PowerShell for running any commands.

Collecting Infrastructure host logs

Cluster-manager: Get logs for cluster manager from the following places and attach it to the ticket.
1. All the logs from CloudWatch log group <env-name>/cluster-manager
2. All the logs under /root/bootstrap/logs directory on the <env-name>-cluster-manager EC2 instance. Follow Logging into EC2 instance from EC2 console to login to you instance.
Vdc-controller: Get logs for vdc-controller from the following places and attach it to the ticket.
1. All the logs from CloudWatch log group <env-name>/vdc-controller
2. All the logs under /root/bootstrap/logs directory on the <env-name>-vdc-controller EC2 instance. Follow Logging into EC2 instance from EC2 console to login to you instance.

One of the ways to get the logs easily to follow the instructions in Downloading logs from Linux EC2 instances section. The module name would be the instance name.

Collecting VDI logs

Identify corresponding EC2 instance

If a user launched a VDI with session name VDI1 , the corresponding name of the instance on EC2 console would be <env-name>-VDI1-<user name>.

Collect Linux VDI logs

Log in to the corresponding EC2 instance from EC2 console following the instruction in Logging into EC2 instance from EC2 console section. Get all the logs under /root/bootstrap/logs and /var/log/dcv/ directory on the VDI EC2 instance.

One of the ways to get the logs would be to upload it to s3 and then download it from there. For that you can follow these steps to get all the logs 1 directory and then upload it:

Follow these steps to copy dcv logs under /root/bootstrap/logs directory

sudo su - 
cd /root/bootstrap
mkdir -p logs/dcv_logs
cp -r /var/log/dcv/* logs/dcv_logs/

Now, follow the steps listed in Downloading VDI Logs to download the the logs

Collect Windows VDI logs

Log in to the corresponding EC2 instance from EC2 console following the instruction in Logging into EC2 instance from EC2 console section. Get all the logs under $env:SystemDrive\Users\Administrator\RES\Bootstrap\Log\ directory on the VDI EC2 instance.

One of the ways to get the logs would be to upload it to s3 and then download it from there. For that you can follow these . Follow the steps listed in Downloading VDI Logs to download the the logs

Downloading VDI Logs

Update the VDI EC2 instance IAM role to allow S3 access
Go to EC2 console → Select your VDI instance → Click on the IAM role it is using → In the Permission Policies section from the Add permissions dropdown menu select Attach Policies → Select AmazonS3FullAccess policy and then click on Add permissions to attach that policy.
After that, you can follow the steps listed below based on your VDI type to download the logs. The module name would be the instance name.
1. Downloading logs from Linux EC2 instances for Linux
2. Downloading logs from Windows EC2 instances for Windows
Lastly edit the role back to remove the AmazonS3FullAccess policy

Note: All VDIs use the same IAM role which is <env-name>-vdc-host-role-<region>

Downloading logs from Linux EC2 instances

Login to the EC2 instance from which you want to download logs and run through the following commands to upload all the logs to s3 bucket:

sudo su -
ENV_NAME=<environment_name>
REGION=<region>
ACCOUNT=<aws_account_number>
MODULE=<module_name>

cd /root/bootstrap
tar -czvf ${MODULE}_logs.tar.gz logs/ --overwrite
aws s3 cp ${MODULE}_logs.tar.gz s3://${ENV_NAME}-cluster-${REGION}-${ACCOUNT}/${MODULE}_logs.tar.gz

After doing this, go to S3, select the bucket with name <environment_name>-cluster-<region>-<aws_account_number> and download the previously uploaded <module_name>_logs.tar.gz file

Downloading logs from Windows EC2 instances

Login to the EC2 instance from which you want to download logs and run through the following commands to upload all the logs to s3 bucket:

$ENV_NAME="<environment_name>"
$REGION="<region>"
$ACCOUNT="<aws_account_number>"
$MODULE="<module_name>"

$logDirPath = Join-Path -Path $env:SystemDrive -ChildPath "Users\Administrator\RES\Bootstrap\Log"
$zipFilePath = Join-Path -Path $env:TEMP -ChildPath "logs.zip"
Remove-Item $zipFilePath
Compress-Archive -Path $logDirPath -DestinationPath $zipFilePath
$bucketName = "${ENV_NAME}-cluster-${REGION}-${ACCOUNT}"
$keyName = "${MODULE}_logs.zip"
Write-S3Object -BucketName $bucketName -Key $keyName -File $zipFilePath

After doing this, go to S3, select the bucket with name <environment_name>-cluster-<region>-<aws_account_number> and download the previously uploaded <module_name>_logs.zip file

Collecting ECS logs for WaitCondition error

Go to the deployed stack → Resources Tab → expand Deploy → ResearchAndEngineeringStudio → Installer → Tasks → *CreateTaskDef → CreateContainer → LogGroup, and click on the log group to open Cloudwatch logs.
Grab the latest log from this log group

Troubleshooting

General Debugging and Monitoring

Useful log and event information sources

Log files on the environment Amazon EC2 instances

CloudFormation Stacks

System failures due to an issue and reflected by Amazon EC2 Auto Scaling Group Activity

Typical Amazon EC2 Console Appearance

Infrastructure hosts

Infrastructure hosts and virtual desktops

Hosts in a terminated state

Useful Active Directory (AD) related commands for reference

Windows DCV debugging

Find NICE DCV Version Information

Issue RunBooks

I am not authorized to perform an action in Research and Engineering Studio on AWS

I am not authorized to perform iam:PassRole

I want to allow people outside of my AWS account to access my Research and Engineering Studio on AWS resources

Storage

I created file system through RES but it doesn’t mount on the VDI hosts

I onboarded file system through RES but it doesn’t mount on the VDI hosts

I am not able to read/write on Amazon FSx for NetApp ONTAP from VDI hosts

Example permission handling use cases

I created Amazon FSx for NetApp ONTAP from RES but it did not join my domain

Snapshots

A Snapshot has a status of Failed

A Snapshot fails to apply with logs indicating that the tables could not be imported.

Infrastructure

Load balancer target groups without healthy instances

Launching Virtual Desktops

A virtual desktop that was previously working is no longer able to connect successfully

I am only able to launch 5 virtual desktops

Desktop Windows connect attempts fail with “The connection has been closed. Transport error”

VDIs stuck in Provisioning state

VDIs get into Error state after launching

Possible issue 1: The home filesytem has directory for the user with different POSIX permissions.

Virtual Desktop Component

Amazon EC2 instance is repeatedly showing terminated in the console

vdc-controller instance is cycling due to failing to join AD / eVDI module shows Failed API Health Check

Project does not appear in the pull down when editing the Software Stack to add it

cluster-manager Amazon CloudWatch log shows “[user-home-init] account not available yet. waiting for user to be synced” (where account is a user name)

Windows desktop on login attempt says “Your account has been disabled. Please see your administrator”

DHCP Options issues with external/customer AD configuration

Env deletion

res-xxx-cluster stack in ‘DELETE_FAILED’ state and cannot be deleted manually due to ‘Role is invalid or cannot be assumed’ error

Collecting Logs

Logging into EC2 instance from EC2 console

Collecting Infrastructure host logs

Collecting VDI logs

Identify corresponding EC2 instance

Collect Linux VDI logs

Collect Windows VDI logs

Downloading VDI Logs

Downloading logs from Linux EC2 instances

Downloading logs from Windows EC2 instances

Collecting ECS logs for WaitCondition error

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VDIs stuck in `Provisioning` state

VDIs get into `Error` state after launching