Troubleshooting failed password changes after installing MS16-101

October 13, 2016, 12:22 pm

≫ Next: Using Debugging Tools to Find Token and Session Leaks

≪ Previous: Access-Based Enumeration (ABE) Troubleshooting (part 2 of 2)

Hi!

Linda Taylor here, Senior Escalation Engineer in the Directory Services space.

I have spent the last month working with customers worldwide who experienced password change failures after installing the updates under Ms16-101 security bulletin KB’s (listed below), as well as working with the product group in getting those addressed and documented in the public KB articles under the known issues section. It has been busy!

In this post I will aim to provide you with a quick “cheat sheet” of known issues and needed actions as well as ideas and troubleshooting techniques to get there.

Let’s start by understanding the changes.

The following 6 articles describe the changes in MS16-101 as well as a list of Known issues. If you have not yet applied MS16-101 I would strongly recommend reading these and understanding how they may affect you.

        3176492 Cumulative update for Windows 10: August 9, 2016
        3176493 Cumulative update for Windows 10 Version 1511: August 9, 2016
        3176495 Cumulative update for Windows 10 Version 1607: August 9, 2016
        3178465 MS16-101: Security update for Windows authentication methods: August 9, 2016
        3167679 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016
        3177108 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016

The good news is that this month’s updates address some of the known issues with MS16-101.

The bad news is that not all the issues are caused by some code defect in MS16-101 and in some cases the right solution is to make your environment more secure by ensuring that the password change can happen over Kerberos and does not need to fall back to NTLM. That may include opening TCP ports used by Kerberos, fixing other Kerberos problems like missing SPN’s or changing your application code to pass in a valid domain name.

Let’s start with the basics…

Symptoms:

After applying MS16-101 fixes listed above, password changes may fail with the error code

“The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.”
Or
“The system cannot contact a domain controller to service the authentication request. Please try again later.”

This text maps to the error codes below:

Hexadecimal	Decimal	Symbolic	Friendly
0xc0000388	1073740920	STATUS_DOWNGRADE_DETECTED	The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.
0x80074f1	1265	ERROR_DOWNGRADE_DETECTED	The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

Question: What does MS16-101 do and why would password changes fail after installing it?

Answer: As documented in the listed KB articles, the security updates that are provided in MS16-101 disable the ability of the Microsoft Negotiate SSP to fall back to NTLM for password change operations in the case where Kerberos fails with the STATUS_NO_LOGON_SERVERS (0xc000005e) error code.
In this situation, the password change will now fail (post MS16-101) with the above mentioned error codes (ERROR_DOWNGRADE_DETECTED / STATUS_DOWNGRADE_DETECTED).
Important: Password RESET is not affected by MS16-101 at all in any scenario. Only password change using the Negotiate package is affected.

So, now you understand the change, let’s look at the known issues and learn how to best identify and resolve those.

Summary and Cheat Sheet

To make it easier to follow I have matched the ordering of known issues in this post with the public KB articles above.

First, when troubleshooting a failed password change post MS16-101 you will need to understand HOW and WHERE the password change is happening and if it is for a domain account or a local account. Here is a cheat sheet.

Summary of SCENARIO’s and a quick reference table of actions needed.

Scenario / Known issue #	Description	Action Needed
1.	Domain password change fails via CTRL+ALT+DEL and shows an error like this: Text: “System detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you. “	Troubleshoot using this guide and fix Kerberos.
2.	Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered. For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION and it now returns STATUS_DOWNGRADE_DETECTED (after installing Ms16-101) causing your application to behave in an expected way or even crash. Note: In these cases password change works ok when correct new password is entered that complies with the password policy.	Install October fixes in the table below.
3.	Local user account password change fails via CTRL+ALT+DEL or application code.	Install October fixes in the table below.
4.	Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.	None. By design.
5.	Domain password change fails via application code when a good password is entered. This is the case where if you pass a servername to NetUserChangePassword, the password change will fail post MS16-101. This is because it would have previously worked and relied on NTLM. NTLM is insecure and Kerberos is always preferred. Therefore passing a domain name here is the way forward. One thing to note for this one is that most of the ADSI and C#/.NET changePassword API’s end up calling NetUserChangePassword under the hood. Therefore, also passing invalid domain names to these API’s will fail. I have provided a detailed walkthrough example in this post with log snippets.	Troubleshoot using this guide and fix code to use Kerberos.
6.	After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.	To resolve this issue, see KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.
7.	After you install the security updates that are described in MS16-101, remote, programmatic changes of a local user account password remotely, and password changes across untrusted forest fail with the STATUS_DOWNGRADE_DETECTED error as documented in this post. This happens because the operation relies on NTLM fall-back since there is no Kerberos without a trust. NTLM fall-back is forbidden by MS16-101.	For this scenario you will need to install October fixes in the table below and set the registry key NegoAllowNtlmPwdChangeFallback documented in KB’s below which allows the NTLM fall back to happen again and unblocks this scenario. http://support.microsoft.com/kb/3178465 http://support.microsoft.com/kb/3167679 http://support.microsoft.com/kb/3177108 http://support.microsoft.com/kb/3176492 http://support.microsoft.com/kb/3176495 http://support.microsoft.com/kb/3176493 Note: you may also consider using this registry key in an emergency for Known Issue#5 when it takes time to update the application code. However please read the above articles carefully and only consider this as a short term solution for scenario 5.

Table of Fixes for known issues above release 2016.10.11, taken from MS16-101 Security Bulletin:

OS	Fix needed
Vista / W2K8	Re-install 3167679, re-released 2016.10.11
Win7 / W2K8 R2	Install 3192391 (security only) or Install 3185330 (monthly rollup that includes security fixes)
WS12	3192393 (security only) or 3185332 (monthly rollup that includes security fixes)
Win8.1 / WS12 R2	3192392 (security only) OR 3185331 ((monthly rollup that includes security fixes)
Windows 10	For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016 For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

Troubleshooting

As I mentioned, this post is intended to support the documentation of the known issues in the Ms16-101 KB articles and provide help and guidance for troubleshooting. It should help you identify which known issue you are experiencing as well as provide resolution suggestions for each case.

I have also included a troubleshooting walkthrough of some of the more complex example cases. We will start with the problem definition, and then look at the available logs and tools to identify a suitable resolution. The idea is to teach “how to fish” because there can be many different scenario’s and hopefully you can apply these techniques and use the log files documented here to help resolve the issues when needed.

Once you know the scenario that you are using for the password change the next step is usually to collect some data on the server or client where the password change is occuring. For example if you have a web server running a password change application and doing password changes on behalf of users, you will need to collect the logs there. If in doubt collect the logs from all involved machines and then look for the right one doing the password change using the snippets in the examples. Here are the helpful logs.

DATA COLLECTION

The same logs will help in all the scenario’s.

LOGS

1. SPENGO debug log/ LSASS.log

To enable this log run the following commands from an elevated admin CMD prompt to set the below registry keys:

reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v SPMInfoLevel /t REG_DWORD /d 0xC03E3F /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v LogToFile /t REG_DWORD /d 1 /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v NegEventMask /t REG_DWORD /d 0xF /f

This will log Negotiate debug output to the %windir%\system32\lsass.log.
There is no need for reboot. The log is effective immediately.
Lsass.log is a text file that is easy to read with a text editor such as Wordpad.

2. Netlogon.log:

This log has been around for many years and is useful for troubleshooting DC LOCATOR traffic. It can be used together with a network trace to understand why the STATUS_NO_LOGON_SERVERS is being returned for the Kerberos password change attempt.

· To enable Netlogon debug logging run the following command from an elevated CMD prompt:

nltest /dbflag:0x26FFFFFF

· The resulting log is found in %windir%\debug\netlogon.log & netlogon.bak

· There is no need for reboot. The log is effective immediately. See also 109626 Enabling debug logging for the Net Logon service

· The Netlogon.log (and Netlogon.bak) is a text file.

Open the log with any text editor (I like good old Notepad.exe)

3. Collect a Network trace during the password change issue using the tool of your choice.

Scenario’s, Explanations and Walkthrough’s:

When reading this you should keep in mind that you may be seeing more than one scenario. The best thing to do is to start with one, fix that and see if there are any other problems left.

1. Domain password change fails via CTRL+ALT+DEL

This is most likely a Kerberos DC locator failure of some kind where the password changes were relying on NTLM before installing MS16-101 and are now failing. This is the simplest and easiest case to resolve using basic Kerberos troubleshooting methods.

Solution: Fix Kerberos.

Some tips from cases which we saw:

1. Use the Network trace to identify if the necessary communication ports are open. This was quite a common issue. So start by checking this.

In order for Kerberos password changes to work communication on TCP port 464 needs to be open between the client doing the
password change and the domain controller.

Note on RODC: Read-only domain controllers (RODCs) can service password changes if the user is allowed by the RODCs password replication policy. Users who are not allowed by the RODC password policy require network connectivity to a read/write domain controller (RWDC) in the user account domain to be able to change the password.

To check whether TCP port 464 is open, follow these steps (also documented in KB3167679):

a. Create an equivalent display filter for your network monitor parser. For example:

ipv4.address== <ip address of client> && tcp.port==464

b. In the results, look for the “TCP:[SynReTransmit” frame.

If you find these, then investigate firewall and open ports. It is often useful to take a simultaneous trace from the client and the domain controller and check if the packets are arriving at the other end.

2. Make sure that the target Kerberos names are valid.

IP addresses are not valid Kerberos names
Kerberos supports short names and fully qualified domain names. Like CONTOSO or Contoso.com

3. Make sure that service principal names (SPNs) are registered correctly.

For more information on troubleshooting Kerberos see https://blogs.technet.microsoft.com/askds/2008/05/14/troubleshooting-kerberos-authentication-problems-name-resolution-issues/ or https://technet.microsoft.com/en-us/library/cc728430(v=ws.10).aspx

2. Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION. After installing Ms16-101 it returns STATUS_DOWNGRADE_DETECTED causing your application to behave in an expected way or even crash.

Note: In this scenario, password change succeeds when correct new password is entered that complies with the password policy.

Cause:

This issue is caused by a code defect in ADSI whereby the status returned from Kerberos was not returned to the user by ADSI correctly.
Here is a more detailed explanation of this one for the geek in you:

Before MS16-101 behavior:

           1. An application calls ChangePassword method from using the ADSI LDAP provider.
           Setting and changing passwords with the ADSI LDAP Provider is documented here.
           Under the hood this calls Negotiate/Kerberos to change the password using a valid realm name.
           Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

2. A 2nd changepassword call is made via NetUserChangePassword API with an intentional realmname as the <dcname> which uses
Negotiate and will retry Kerberos. Kerberos fails with STATUS_NO_LOGON_SERVERS because a DC name is not a valid realm name.

3.Negotiate then retries over NTLM which succeeds or returns the same previous failure status.

The password change fails if a bad password was entered and the NTLM error code is returned back to the application. If a valid password was entered, everything works because the 1st change password call passes in a good name and if Kerberos works, the password change operation succeeds and you never enter into step 3.

Post MS16-101 behavior /why it fails with MS16-101 installed:

         1. An application calls ChangePassword method from using the ADSI LDAP provider. This calls Negotiate for the password change with
          a valid realm name.
         Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

2. A 2nd ChangePassword call is made via NetUserChangePassword with a <dcname> as realm name which fails over Kerberos with
STATUS_NO_LOGON_SERVERS which triggers NTLM fallback.

3. Because NTLM fallback is blocked on MS16-101, Error STATUS_DOWNGRADE_DETECTED is returned to the calling app.

Solution: Easy. Install the October update which will fix this issue. The fix lies in adsmsext.dll included in the October updates.

Again, here are the updates you need to install, Taken from MS16-101 Security Bulletin:

OS	Fix needed
Vista / W2K8	Re-install 3167679, re-released 2016.10.11
Win7 / W2K8 R2	Install 3192391 (security only) or Install 3185330 (monthly rollup that includes security fixes)
WS12	3192393 (security only) or 3185332 (monthly rollup that includes security fixes)
Win8.1 / WS12 R2	3192392 (security only) OR 3185331 ((monthly rollup that includes security fixes)
Windows 10	For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016 For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

3.Local user account password change fails via CTRL+ALT+DEL or application code.

Installing October updates above should also resolve this.

MS16-101 had a defect where Negotiate did not correctly determine that the password change was local and would try to find a DC using the local machine as the domain name.

This failed and NTLM fallback was no longer allowed post MS16-101. Therefore, the password changes failed with STATUS_DOWNGRADE_DETECTED.

Example:

One such scenario which I saw where password changes of local user accounts via ctrl+alt+delete failed with the message “The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you.” Was when you have the following group policy set and you try to change a password of a local account:

Policy	Computer Configuration \ Administrative Templates \ System \ Logon\“Assign a default domain for logon”
Path	HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\DefaultLogonDomain
Setting	DefaultLogonDomain
Data Type	REG_SZ
Value	“.” (less quotes). The period or “dot” designates the local machine name
Notes

Cause: In this case, post MS16-101 Negotiate incorrectly determined that the account is not local and tried to discover a DC using \\<machinename> as the domain and failed. This caused the password change to fail with the STATUS_DOWNGRADE_DETECTED error.

Solution: Install October fixes listed in the table at the top of this post.

4.Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

MS16-101 purposely disabled changing the password of locked-out or disabled user account passwords via Negotiate by design.

Important: Password Reset is not affected by MS16-101 at all in any scenario. Only password change. Therefore, any application which is doing a password Reset will be unaffected by Ms16-101.

Another important thing to note is that MS16-101 only affects applications using Negotiate. Therefore, it is possible to change locked-out and disabled account password using other method’s such as LDAPs.

For example, the PowerShell cmdlet Set-ADAccountPassword will continue to work for locked out and disabled account password changes as it does not use Negotiate.

5. Troubleshooting Domain password change failure via application code when a good password is entered.

This is one of the most difficult scenarios to identify and troubleshoot. And therefore I have provided a more detailed example here including sample code, the cause and solution.

In summary, the solution for these cases is almost always to correct the application code which maybe passing in an invalid domain name such that Kerberos fails with STATUS_NO_LOGON_SERVERS.

Scenario:

An application is using system.directoryservices.accountmanagement namespace to change a users password.
https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement(v=vs.110).aspx

After installing Ms16-101 password changes fail with STATUS_DOWNGRADE_DETECTED. Example .NET failing code snippet using PowerShell which worked before MS16-101:

<snip>

Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$ctoptions = [System.DirectoryServices.AccountManagement.ContextOptions]::SimpleBind -bor [System.DirectoryServices.AccountManagement.ContextOptions]::ServerBind
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”, ,$ctoptions)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)

<snip>

Data Analysis

There are 2 possibilities here:
(a) The application code is passing an incorrect domain name parameter causing Kerberos password change to fail to locate a DC.
(b) Application code is good and Kerberos password change fails for other reason like blocked port or DNS issue or missing SPN.

Let’s start with (a) The application code is passing an incorrect domain name/parameter causing Kerberos password change to fail to locate a DC.

(a) Data Analysis Walkthrough Example based on a real case:

1. Start with Lsass.log (SPNEGO trace)

If you are troubleshooting a password change failure after MS16-101 look for the following text in Lsass.log to indicate that Kerberos failed and NTLM fallback was forbidden by Ms16-101:

Failing Example:

[ 9/13 10:23:36] 492.2448> SPM-WAPI: [11b0.1014] Dispatching API (Message 0)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcDispatch: dispatching ChangeAccountPassword (1a)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcChangeAccountPassword()
[ 9/13 10:23:36] 492.2448> SPM-Helpers: [11b0] LsapCopyFromClient(0000005EAB78C9D8, 000000DA664CE5E0, 16) = 0
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword:
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NegoExtender
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: Kerberos
[ 9/13 10:23:36] 492.2448> SPM-Warning: Failed to change password for account Test: 0xc000005e
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NTLM
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, NTLM failed: not allowed to change domain passwords
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, returning: 0xc0000388

0xc000005E is STATUS_NO_LOGON_SERVERS
0xc0000388 is STATUS_DOWNGRADE_DETECTED

If you see this, it means Kerberos failed to locate a Domain Controller in the domain and fallback to NTLM is not allowed by Ms16-101. Next you should look at the Netlogon.log and the Network trace to understand why.

2. Network trace

Look at the network trace and filter the traffic based on the client IP, DNS and any authentication related traffic.
You may see the client is requesting a Kerberos ticket using an invalid SPN like:

Source	Destination	Description
Client	DC1	KerberosV5:TGS Request Realm: CONTOSO.COM Sname: ldap/contoso.com {TCP:45, IPv4:7}
DC1	Client	KerberosV5:KRB_ERROR – KDC_ERR_S_PRINCIPAL_UNKNOWN (7) {TCP:45, IPv4:7}

So here the client tried to get a ticket for this ldap\Contoso.com SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere.

This is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com

Next let’s check the Netlogon.log

3. Netlogon.log:

Open the log with any text editor (I like good old Notepad.exe) and check the following:

Is a valid domain name being passed to DC locator?

Invalid names such as \\servername.contoso.com or IP address \\x.y.x.w will cause dclocator to fail and thus Kerberos password change to return STATUS_NO_LOGON_SERVERS. Once that happens NTLM fall back is not allowed and you get a failed password change.

If you find this issue examine the application code and make necessary changes to ensure correct domain name format is being passed to the ChangePassword API that is being used.

Example of failure in Netlogon.log:

[MISC] [PID] DsGetDcName function called: client PID=1234, Dom:\\contoso.com Acct:(null) Flags: IP KDC
[MISC] [PID] DsGetDcName function returns 1212 (client PID=1234): Dom:\\contoso.com Acct:(null) Flags: IP KDC

\\contoso.com is not a valid domain name. (contoso.com is a valid domain name)

This Error translates to:

0x4bc

1212

ERROR_INVALID_DOMAINNAME

The format of the specified domain name is invalid.

winerror.h

So what happened here?

The application code passed an invalid TargetName to kerberos. It used the domain name as a server name and so we see the SPN of LDAP\contoso.com.

The client tried to get a ticket for this SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere. As Noted: this is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com.

The application code then tried the password change again and passed in \\contoso.com as a domain name for the password change. Anything beginning with \\ as domain name is not valid. IP address is not valid. So DCLOCATOR will fail to locate a DC when given this domain name. We can see this in the Netlogon.log and the Network trace.

Conclusion and Solution

If the domain name is invalid here, examine the code snippet which is doing the password change to understand why the wrong name is passed in.

The fix in these cases will be to change the code to ensure a valid domain name is passed to Kerberos to allow the password change to successfully happen over Kerberos and not NTLM. NTLM is not secure. If Kerberos is possible, it should be the protocol used.

SOLUTION

The solution here was to remove “ContextOptions.ServerBind | ContextOptions.SimpleBind ” and allow the code to use the default (Negotiate). Note, because we were using a domain context but ServerBind this caused the issue. Negotiate with Domain context is the option that works and is successfully able to use kerberos.

Working code:

<snip>
Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)
<snip>

Why does this code work before MS16-101 and fail after?

ContextOptions are documented here: https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement.contextoptions(v=vs.110).aspx

Specifically: “This parameter specifies the options that are used for binding to the server. The application can set multiple options that are linked with a bitwise OR operation. “

Passing in a domain name such as contoso.com with the ContextOptions ServerBind or SimpleBind causes the client to attempt to use an SPN like ldap\contoso.com because it expects the name which is passed in to be a ServerName.

This is not a valid SPN and does not exist, therefore this will fail and as a result Kerberos will fail with STATUS_NO_LOGON_SERVERS.
Before MS16-101, in this scenario, the Negotiate package would fall back to NTLM, attempt the password change using NTLM and succeed.
Post MS16-101 this fall back is not allowed and Kerberos is enforced.

(b) If Application Code is good but Kerberos fails to locate a DC for other reason

If you see a correct domain name and SPN’s in the above logs, then the issue is that kerberos fails for some other reason such as blocked TCP ports. In this case revert to Scenario 1 to troubleshoot why Kerberos failed to locate a Domain Controller.

There is a chance that you may also have both (a) and (b). Traces and logs are the best tools to identify.

Scenario6: After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

I will not go into detail of this scenario as it is well described in the KB article KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

That’s all for today! I hope you find this useful. I will update this post if any new information arises.

Linda Taylor | Senior Escalation Engineer | Windows Directory Services
(A well established member of the content police.)

↧

Using Debugging Tools to Find Token and Session Leaks

April 5, 2017, 10:59 am

≫ Next: Active Directory Experts: apply within

≪ Previous: Troubleshooting failed password changes after installing MS16-101

Hello AskDS readers and Identity aficionados. Long time no blog.

Ryan Ries here, and today I have a relatively “hardcore” blog post that will not be for the faint of heart. However, it’s about an important topic.

The behavior surrounding security tokens and logon sessions has recently changed on all supported versions of Windows. IT professionals – developers and administrators alike – should understand what this new behavior is, how it can affect them, and how to troubleshoot it.

But first, a little background…

Figure 1 – Tokens

Windows uses security tokens (or access tokens) extensively to control access to system resources. Every thread running on the system uses a security token, and may own several at a time. Threads inherit the security tokens of their parent processes by default, but they may also use special security tokens that represent other identities in an activity known as impersonation. Since security tokens are used to grant access to resources, they should be treated as highly sensitive, because if a malicious user can gain access to someone else’s security token, they will be able to access resources that they would not normally be authorized to access.

Note: Here are some additional references you should read first if you want to know more about access tokens:

If you are an application developer, your application or service may want to create or duplicate tokens for the legitimate purpose of impersonating another user. A typical example would be a server application that wants to impersonate a client to verify that the client has permissions to access a file or database. The application or service must be diligent in how it handles these access tokens by releasing/destroying them as soon as they are no longer needed. If the code fails to call the CloseHandle function on a token handle, that token can then be “leaked” and remain in memory long after it is no longer needed.

And that brings us to Microsoft Security Bulletin MS16-111.

Here is an excerpt from that Security Bulletin:

Multiple Windows session object elevation of privilege vulnerabilities exist in the way that Windows handles session objects.

A locally authenticated attacker who successfully exploited the vulnerabilities could hijack the session of another user.
To exploit the vulnerabilities, the attacker could run a specially crafted application.
The update corrects how Windows handles session objects to prevent user session hijacking.

Those vulnerabilities were fixed with that update, and I won’t further expound on the “hacking/exploiting” aspect of this topic. We’re here to explore this from a debugging perspective.

This update is significant because it changes how the relationship between tokens and logon sessions is treated across all supported versions of Windows going forward. Applications and services that erroneously leak tokens have always been with us, but the penalty paid for leaking tokens is now greater than before. After MS16-111, when security tokens are leaked, the logon sessions associated with those security tokens also remain on the system until all associated tokens are closed… even after the user has logged off the system. If the tokens associated with a given logon session are never released, then the system now also has a permanent logon session leak as well. If this leak happens often enough, such as on a busy Remote Desktop/Terminal Server where users are logging on and off frequently, it can lead to resource exhaustion on the server, performance issues and denial of service, ultimately causing the system to require a reboot to be returned to service.

Therefore, it’s more important than ever to be able to identify the symptoms of token and session leaks, track down token leaks on your systems, and get your application vendors to fix them.

How Do I Know If My Server Has Leaks?

As mentioned earlier, this problem affects heavily-utilized Remote Desktop Session Host servers the most, because users are constantly logging on and logging off the server. The issue is not limited to Remote Desktop servers, but symptoms will be most obvious there.

Figuring out that you have logon session leaks is the easy part. Just run qwinsta at a command prompt:

Figure 2 – qwinsta

Pay close attention to the session ID numbers, and notice the large gap between session 2 and session 152. This is the clue that the server has a logon session leak problem. The next user that logs on will get session 153, the next user will get session 154, the next user will get session 155, and so on. But the session IDs will never be reused. We have 150 “leaked” sessions in the screenshot above, where no one is logged on to those sessions, no one will ever be able to log on to those sessions ever again (until a reboot,) yet they remain on the system indefinitely. This means each user who logs onto the system is inadvertently leaving tokens lying around in memory, probably because some application or service on the system duplicated the user’s token and didn’t release it. These leaked sessions will forever be unusable and soak up system resources. And the problem will only get worse as users continue to log on to the system. In an optimal situation where there were no leaks, sessions 3-151 would have been destroyed after the users logged out and the resources consumed by those sessions would then be reusable by subsequent logons.

How Do I Find Out Who’s Responsible?

Now that you know you have a problem, next you need to track down the application or service that is responsible for leaking access tokens. When an access token is created, the token is associated to the logon session of the user who is represented by the token, and an internal reference count is incremented. The reference count is decremented whenever the token is destroyed. If the reference count never reaches zero, then the logon session is never destroyed or reused. Therefore, to resolve the logon session leak problem, you must resolve the underlying token leak problem(s). It’s an all-or-nothing deal. If you fix 10 token leaks in your code but miss 1, the logon session leak will still be present as if you had fixed none.

Before we proceed: I would recommend debugging this issue on a lab machine, rather than on a production machine. If you have a logon session leak problem on your production machine, but don’t know where it’s coming from, then install all the same software on a lab machine as you have on the production machine, and use that for your diagnostic efforts. You’ll see in just a second why you probably don’t want to do this in production.

The first step to tracking down the token leaks is to enable token leak tracking on the system.

Modify this registry setting:

HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Kernel
    SeTokenLeakDiag = 1 (DWORD)

The registry setting won’t exist by default unless you’ve done this before, so create it. It also did not exist prior to MS16-111, so don’t expect it to do anything if the system does not have MS16-111 installed. This registry setting enables extra accounting on token issuance that you will be able to detect in a debugger, and there may be a noticeable performance impact on busy servers. Therefore, it is not recommended to leave this setting in place unless you are actively debugging a problem. (i.e. don’t do it in production exhibit A.)

Prior to the existence of this registry setting, token leak tracing of this kind used to require using a checked build of Windows. And Microsoft seems to not be releasing a checked build of Server 2016, so… good timing.

Next, you need to configure the server to take a full or kernel memory dump when it crashes. (A live kernel debug may also be an option, but that is outside the scope of this article.) I recommend using DumpConfigurator to configure the computer for complete crash dumps. A kernel dump should be enough to see most of what we need, but get a Complete dump if you can.

Figure 3 – DumpConfigurator

Then reboot the server for the settings to take effect.

Next, you need users to log on and off the server, so that the logon session IDs continue to climb. Since you’re doing this in a lab environment, you might want to use a script to automatically logon and logoff a set of test users. (I provided a sample script for you here.) Make sure you’ve waited 10 minutes after the users have logged off to verify that their logon sessions are permanently leaked before proceeding.

Finally, crash the box. Yep, just crash it. (i.e. don’t do it in production exhibit B.) On a physical machine, this can be done by hitting Right-Ctrl+Scroll+Scroll if you configured the appropriate setting with DumpConfigurator earlier. If this is a Hyper-V machine, you can use the following PowerShell cmdlet on the Hyper-V host:

Debug-VM -VM (Get-VM RDS1) -InjectNonMaskableInterrupt

You may have at your disposal other means of getting a non-maskable interrupt to the machine, such as an out-of-band management card (iLO/DRAC, etc.,) but the point is to deliver an NMI to the machine, and it will bugcheck and generate a memory dump.

Now transfer the memory dump file (C:\Windows\Memory.dmp usually) to whatever workstation you will use to perform your analysis.

Note: Memory dumps may contain sensitive information, such as passwords, so be mindful when sharing them with strangers.

Next, install the Windows Debugging Tools on your workstation if they’re not already installed. I downloaded mine for this demo from the Windows Insider Preview SDK here. But they also come with the SDK, the WDK, WPT, Visual Studio, etc. The more recent the version, the better.

Next, download the MEX Debugging Extension for WinDbg. Engineers within Microsoft have been using the MEX debugger extension for years, but only recently has a public version of the extension been made available. The public version is stripped-down compared to the internal version, but it’s still quite useful. Unpack the file and place mex.dll into your C:\Debuggers\winext directory, or wherever you installed WinDbg.

Now, ensure that your symbol path is configured correctly to use the Microsoft public symbol server within WinDbg:

Figure 4 – Example Symbol Path in WinDbg

The example symbol path above tells WinDbg to download symbols from the specified URL, and store them in your local C:\Symbols directory.

Finally, you are ready to open your crash dump in WinDbg:

Figure 5 – Open Crash Dump from WinDbg

After opening the crash dump, the first thing you’ll want to do is load the MEX debugging extension that you downloaded earlier, by typing the command:

Figure 6 – .load mex

The next thing you probably want to do is start a log file. It will record everything that goes on during this debugging session, so that you can refer to it later in case you forgot what you did or where you left off.

Figure 7 – !logopen

Another useful command that is among the first things I always run is !DumpInfo, abbreviated !di, which simply gives some useful basic information about the memory dump itself, so that you can verify at a glance that you’ve got the correct dump file, which machine it came from and what type of memory dump it is.

Figure 8 – !DumpInfo

You’re ready to start debugging.

At this point, I have good news and I have bad news.

The good news is that there already exists a super-handy debugger extension that lists all the logon session kernel objects, their associated token reference counts, what process was responsible for creating the token, and even the token creation stack, all with a single command! It’s
!kdexts.logonsession, and it is awesome.

The bad news is that it doesn’t work… not with public symbols. It only works with private symbols. Here is what it looks like with public symbols:

Figure 9 – !kdexts.logonsession – public symbols lead to lackluster output

As you can see, most of the useful stuff is zeroed out.

Since public symbols are all you have unless you work at Microsoft, (and we wish you did,) I’m going to teach you how to do what
!kdexts.logonsession does, manually. The hard way. Plus some extra stuff. Buckle up.

First, you should verify whether token leak tracking was turned on when this dump was taken. (That was the registry setting mentioned earlier.)

Figure 10 - SeTokenLeakTracking = <no type information>

Figure 10 – x nt!SeTokenLeakTracking = <no type information>

OK… That was not very useful. We’re getting <no type information> because we’re using public symbols. But this symbol corresponds to the SeTokenLeakDiag registry setting that we configured earlier, and we know that’s just 0 or 1, so we can just guess what type it is:

Figure 11 – db nt!SeTokenLeakTracking L1

The db command means “dump bytes.” (dd, or “dump DWORDs,” would have worked just as well.) You should have a symbol for
nt!SeTokenLeakTracking if you configured your symbol path properly, and the L1 tells the debugger to just dump the first byte it finds. It should be either 0 or 1. If it’s 0, then the registry setting that we talked about earlier was not set properly, and you can basically just discard this dump file and get a new one. If it’s 1, you’re in business and may proceed.

Next, you need to locate the logon session lists.

Figure 12 – dp nt!SepLogonSessions L1

Like the previous step, dp means “display pointer,” then the name of the symbol, and L1 to just display a single pointer. The 64-bit value on the right is the pointer, and the 64-bit value on the left is the memory address of that pointer.

Now we know where our lists of logon sessions begin. (Lists, plural.)

The SepLogonSessions pointer points to not just a list, but an array of lists. These lists are made up of _SEP_LOGON_SESSION_REFERENCES structures.

Using the dps command (display contiguous pointers) and specifying the beginning of the array that we got from the last step, we can now see where each of the lists in the array begins:

Figure 13 – dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

If there were not very many logon sessions on the system when the memory dump was taken, you might notice that not all the lists are populated:

Figure 14 – Some of the logon session lists are empty because not very many users had logged on in this example

The array doesn’t fill up contiguously, which is a bummer. You’ll have to skip over the empty lists.

If we wanted to walk just the first list in the array (we’ll talk more about dt and linked lists in just a minute,) it would look something like this:

Figure 15 – Walking the first list in the array and using !grep to filter the output

Notice that I used the !grep command to filter the output for the sake of brevity and readability. It’s part of the Mex debugger extension. I told you it was handy. If you omit the !grep AccountName part, you would get the full, unfiltered output. I chose “AccountName” arbitrarily as a keyword because I knew that was a word that was unique to each element in the list. !grep will only display lines that contain the keyword(s) that you specify.

Next, if we wanted to walk through the entire array of lists all at once, it might look something like this:

Figure 16 – Walking through the entire array of lists!

OK, I realize that I just went bananas there, but I’ll explain what just happened step-by-step.

When you are using the Mex debugger extension, you have access to many new text parsing and filtering commands that can truly enhance your debugging experience. When you look at a long command like the one I just showed, read it from right to left. The commands on the right are fed into the command to their left.

So from right to left, let’s start with !cut -f 2 dps ffffb808`3ea02650

We already showed what the dps <address> command did earlier. The !cut -f 2 command filters that command’s output so that it only displays the second part of each line separated by whitespace. So essentially, it will display only the pointers themselves, and not their memory addresses.

Like this:

Figure 17 – Using !cut to select just the second token in each line of output

Then that is “piped” line-by-line into the next command to the left, which was:

!fel -x “dt nt!_SEP_LOGON_SESSION_REFERENCES @#Line -l Next”

!fel is an abbreviation for !foreachline.

This command instructs the debugger to execute the given command for each line of output supplied by the previous command, where the @#Line pseudo-variable represents the individual line of output. For each line of output that came from the dps command, we are going to use the dt command with the -l parameter to walk that list. (More on walking lists in just a second.)

Next, we use the !grep command to filter all of that output so that only a single unique line is shown from each list element, as I showed earlier.

Finally, we use the !count -q command to suppress all of the output generated up to that point, and instead only tell us how many lines of output it would have generated. This should be the total number of logon sessions on the system.

And 380 was in fact the exact number of logon sessions on the computer when I collected this memory dump. (Refer to Figure 16.)

Alright… now let’s take a deep breath and a step back. We just walked an entire array of lists of structures with a single line of commands. But now we need to zoom in and take a closer look at the data structures contained within those lists.

Remember, ffffb808`3ea02650 was the very beginning of the entire array.

Let’s examine just the very first _SEP_LOGON_SESSION_REFERENCES entry of the first list, to see what such a structure looks like:

Figure 18 – dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

That’s a logon session!

Let’s go over a few of the basic fields in this structure. (Skipping some of the more advanced ones.)

Next: This is a pointer to the next element in the list. You might notice that there’s a “Next,” but there’s no “Previous.” So, you can only walk the list in one direction. This is a singly-linked list.
LogonId: Every logon gets a unique one. For example, “0x3e7” is always the “System” logon.
ReferenceCount: This is how many outstanding token references this logon session has. This is the number that must reach zero before the logon session can be destroyed. In our example, it’s 4.
AccountName: The user who does or used to occupy this session.
AuthorityName: Will be the user’s Active Directory domain, typically. Or the computer name if it’s a local account.
TokenList: This is a doubly or circularly-linked list of the tokens that are associated with this logon session. The number of tokens in this list should match the ReferenceCount.

The following is an illustration of a doubly-linked list:

Figure 19 – Doubly or circularly-linked list

“Flink” stands for Forward Link, and “Blink” stands for Back Link.

So now that we understand that the TokenList member of the _SEP_LOGON_SESSION_REFERENCES structure is a linked list, here is how you walk that list:

Figure 20 – dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

The dt command stands for “display type,” followed by the symbol name of the type that you want to cast the following address to. The reason why we specified the address 0xffffb808`500bdba0 is because that is the address of the _SEP_LOGON_SESSION_REFERENCES object that we found earlier. The reason why we added +0x0b0 after the memory address is because that is the offset from the beginning of the structure at which the TokenList field begins. The -l parameter specifies that we’re trying to walk a list, and finally you must specify a field name (Flink in this case) that tells the debugger which field to use to navigate to the next node in the list.

We walked a list of tokens and what did we get? A list head and 4 data nodes, 5 entries total, which lines up with the ReferenceCount of 4 tokens that we saw earlier. One of the nodes won’t have any data – that’s the list head.

Now, for each entry in the linked list, we can examine its data. We know the payloads that these list nodes carry are tokens, so we can use dt to cast them as such:

Figure 21 – dt _TOKEN*0xffffb808`4f565f40+8+8 – Examining the first token in the list

The reason for the +8+8 on the end is because that’s the offset of the payload. It’s just after the Flink and Blink as shown in Figure 19. You want to skip over them.

We can see that this token is associated to SessionId 0x136/0n310. (Remember I had 380 leaked sessions in this dump.) If you examine the UserAndGroups member by clicking on its DML (click the link,) you can then use !sid to see the SID of the user this token represents:

Figure 22 – Using !sid to see the security identifier in the token

The token also has a DiagnosticInfo structure, which is super-interesting, and is the coolest thing that we unlocked when we set the SeTokenLeakDiag registry setting on the machine earlier. Let’s look at it:

Figure 23 – Examining the DiagnosticInfo structure of the first token

We now have the process ID and the thread ID that was responsible for creating this token! We could examine the ImageFileName, or we could use the ProcessCid to see who it is:

Figure 24 – Using !mex.tasklist to find a process by its PID

Oh… Whoops. Looks like this particular token leak is lsass’s fault. You’re just going to have to let the *ahem* application vendor take care of that one.

Let’s move on to a different token leak. We’re moving on to a different memory dump file as well, so the memory addresses are going to be different from here on out.

I created a special token-leaking application specifically for this article. It looks like this:

Figure 25 – RyansTokenGrabber.exe

It monitors the system for users logging on, and as soon as they do, it duplicates their token via the DuplicateToken API call. I purposely never release those tokens, so if I collect a memory dump of the machine while this is running, then evidence of the leak should be visible in the dump, using the same steps as before.

Using the same debugging techniques I just demonstrated, I verified that I have leaked logon sessions in this memory dump as well, and each leaked session has an access token reference that looks like this:

Figure 26 – A _TOKEN structure shown with its attached DiagnosticInfo

And then by looking at the token’s DiagnosticInfo, we find that the guilty party responsible for leaking this token is indeed RyansTokenGrabber.exe:

Figure 27 – The process responsible for leaking this token

By this point you know who to blame, and now you can go find the author of RyansTokenGrabber.exe, and show them the stone-cold evidence that you’ve collected about how their application is leaking access tokens, leading to logon session leaks, causing you to have to reboot your server every few days, which is a ridiculous and inconvenient thing to have to do, and you shouldn’t stand for it!

We’re almost done. but I have one last trick to show you.

If you examine the StackTrace member of the token’s DiagnosticInfo, you’ll see something like this:

Figure 28 – DiagnosticInfo.CreateTrace

This is a stack trace. It’s a snapshot of all the function calls that led up to this token’s creation. These stack traces grew upwards, so the function at the top of the stack was called last. But the function addresses are not resolving. We must do a little more work to figure out the names of the functions.

First, clean up the output of the stack trace:

Figure 29 – Using !grep and !cut to clean up the output

Now, using all the snazzy new Mex magic you’ve learned, see if you can unassemble (that’s the u command) each address to see if resolves to a function name:

Figure 30 – Unassemble instructions at each address in the stack trace

The output continues beyond what I’ve shown above, but you get the idea.

The function on top of the trace will almost always be SepDuplicateToken, but could also be SepCreateToken or SepFilterToken, and whether one creation method was used versus another could be a big hint as to where in the program’s code to start searching for the token leak. You will find that the usefulness of these stacks will vary wildly from one scenario to the next, as things like inlined functions, lack of symbols, unloaded modules, and managed code all influence the integrity of the stack. However, you (or the developer of the application you’re using) can use this information to figure out where the token is being created in this program, and fix the leak.

Alright, that’s it. If you’re still reading this, then… thank you for hanging in there. I know this wasn’t exactly a light read.

And lastly, allow me to reiterate that this is not just a contrived, unrealistic scenario; There’s a lot of software out there on the market that does this kind of thing. And if you happen to write such software, then I really hope you read this blog post. It may help you improve the quality of your software in the future. Windows needs application developers to be “good citizens” and avoid writing software with the ability to destabilize the operating system. Hopefully this blog post helps someone out there do just that.

Until next time,
Ryan “Too Many Tokens” Ries

↧

Active Directory Experts: apply within

July 26, 2017, 2:56 pm

≫ Next: Introducing Lingering Object Liquidator v2

≪ Previous: Using Debugging Tools to Find Token and Session Leaks

Hi all! Justin Turner here from the Directory Services team with a brief announcement: We are hiring!

Would you like to join the U.S. Directory Services team and work on the most technically challenging and interesting Active Directory problems? Do you want to be the next Ned Pyle or Linda Taylor?

Then read more…

We are an escalation team based out of Irving, Texas; Charlotte, North Carolina; and Fargo, North Dakota. We work with enterprise customers helping them resolve the most critical Active Directory infrastructure problems as well as enabling them to get the best of Microsoft Windows and Identity-related technologies. The work we do is no ordinary support – we work with a huge variety of customer environments and there are rarely two problems which are the same.

You will need strong AD knowledge, strong troubleshooting skills, along with great collaboration, team work and customer service skills.

If this sounds like you, please apply here:

Irving, Texas (Las Colinas):
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=2&jid=290399&jlang=en&pp=ss

Charlotte, North Carolina:
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=1&jid=290398&jlang=en&pp=ss

U.S. citizenship is required for these positions.

Justin

↧

Introducing Lingering Object Liquidator v2

October 9, 2017, 11:04 am

≫ Next: ESE Deep Dive: Part 1: The Anatomy of an ESE database

≪ Previous: Active Directory Experts: apply within

Greetings again AskDS!

Ryan Ries here. Got something exciting to talk about.

You might be familiar with the original Lingering Object Liquidator tool that was released a few years ago.

Today, we’re proud to announce version 2 of Lingering Object Liquidator!

Because Justin’s blog post from 2014 covers the fundamentals of what lingering objects are so well, I don’t think I need to go over it again here. If you need to know what lingering objects in Active Directory are, and why you want to get rid of them, then please go read that post first.

The new version of the Lingering Object Liquidator tool began its life as an attempt to address some of the long-standing limitations of the old version. For example, the old version would just stop the entire scan when it encountered a single domain controller that was unreachable. The new version will just skip the unreachable DC and continue scanning the other DCs that are reachable. There are multiple other improvements in the tool as well, such as multithreading and more exhaustive logging.

Before we take a look at the new tool, there are some things you should know:

1) Lingering Object Liquidator – neither the old version nor the new version – are covered by CSS Support Engineers. A small group of us (including yours truly) have provided this tool as a convenience to you, but it comes with no guarantees. If you find a problem with the tool, or have a feature request, drop a line to the public AskDS public email address, or submit feedback to the Windows Server UserVoice forum, but please don’t bother Support Engineers with it on a support case.

2) Don’t immediately go into your production Active Directory forest and start wildly deleting things just because they show up as lingering objects in the tool. Please carefully review and consider any AD objects that are reported to be lingering objects before deleting.

3) The tool may report some false positives for deleted objects that are very close to the garbage collection age. To mitigate this issue, you can manually initiate garbage collection on your domain controllers before using this tool. (We may add this so the tool does it automatically in the future.)

4) The tool will continue to evolve and improve based on your feedback! Contact the AskDS alias or the Uservoice forum linked to in #1 above with any questions, concerns, bug reports or feature requests.

Graphical User Interface Elements

Let’s begin by looking at the graphical user interface. Below is a legend that explains each UI element:

Lingering Object Liquidator v2

A) “Help/About” label. Click this and a page should open up in your default web browser with extra information and detail regarding Lingering Object Liquidator.

B) “Check for Updates” label. Click this and the tool will check for a newer version than the one you’re currently running.

C) “Detect AD Topology” button. This is the first button that should be clicked in most scenarios. The AD Topology must be generated first, before proceeding on to the later phases of lingering object detection and removal.

D) “Naming Context” drop-down menu. (Naming Contexts are sometimes referred to as partitions.) Note that this drop-down menu is only available after AD Topology has been successfully discovered. It contains each Active Directory naming context in the forest. If you know precisely which Active Directory Naming context that you want to scan for lingering objects, you can select it from this menu. (Note: The Schema partition is omitted because it does not support deletion, so in theory it cannot contain lingering objects.) If you do not know which naming contexts may contain lingering objects, you can select the “[Scan All NCs]” option and the tool will scan each Naming Context that it was able to discover during the AD Topology phase.

E) “Reference DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The reference DC is the “known-good” DC against which you will compare other domain controllers for lingering objects. If a domain controller contains AD objects that do not exist on the Reference DC, they will be considered lingering objects. If you select the “[Scan Entire Forest]” option, then the tool will (arbitrarily) select one global catalog from each domain in the forest that is known to be reachable. It is recommended that you wisely choose a known-good DC yourself, because the tool doesn’t necessarily know “the best” reference DC to pick. It will pick one at random.

F) “Target DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The Target DC is the domain controller that is suspected of containing lingering objects. The Target DC will be compared against the Reference DC, and each object that exists on the Target DC but not on the Reference DC is considered a lingering object. If you aren’t sure which DC(s) contain lingering objects, or just want to scan all domain controllers, select the “[Target All DCs]” option from the drop-down menu.

G) “Detect Lingering Objects” button. Note that this button is only available after AD Topology has been successfully discovered. After you have made the appropriate selections in the three aforementioned drop-down menus, click the Detect Lingering Objects button to run the scan. Clicking this button only runs a scan; it does not delete anything. The tool will automatically detect and avoid certain nonsensical situations, such as the user specifying the same Reference and Target DCs, or selecting a Read-Only Domain Controller (RODC) as a Reference DC.

H) “Select All” button. Note that this button does not become available until after lingering objects have been detected. Clicking it merely selects all rows from the table below.

I) “Remove Selected Lingering Objects” button. This button will attempt to delete all lingering objects that have been detected by the detection process. You can select a range of items from the list using the shift key and the arrow keys. You can select and unselect specific items by holding down the control key and clicking on them. If you want to just select all items, click the “Select All” button.

J) “Removal Method” radio buttons. These are mutually exclusive. You can choose which of the two supported methods you want to use to remove the lingering objects that have been detected. The “removeLingeringObject” method refers to the rootDSE modify operation, which can be used to “spot-remove” individual lingering objects. In contrast, the DsReplicaVerifyObjects method will remove all lingering objects all at once. This intention is reflected in the GUI by all lingering objects automatically being selected when the DsReplicaVerifyObjects method is chosen.

K) “Import” button. This imports a previously-exported list of lingering objects.

L) “Export” button. This exports the selected lingering objects to a file.

M) The “Target DC Column.” This column tells you which Target DC was seen to have contained the lingering object.

N) The “Reference DC Column.” This column tells you which Reference DC was used to determine that the object in question was lingering.

O) The “Object DN Column.” This column contains the distinguished name of the lingering object.

P) The “Naming Context Column.” This column contains the Naming Context that the lingering object resides in.

Q) The “Lingering Object ListView”. This “ListView” works similarly to a spreadsheet. It will display all lingering objects that were detected. You can think of each row as a lingering object. You can click on the column headers to sort the rows in ascending or descending order, and you can resize the columns to fit your needs. NOTE: If you right-click on the lingering object listview, the selected lingering objects (if any) will be copied to your clipboard.

R) The “Status” box. The status box contains diagnostics and operational messages from the Lingering Object Liquidator tool. Everything that is logged to the status box in the GUI is also mirrored to a text log file.

User-configurable Settings

The user-configurable settings in Lingering Object Liquidator are alluded to in the Status box when the application first starts.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: DetectionTimeoutPerDCSeconds
Type: DWORD
Default: 300 seconds

This setting affects the “Detect Lingering Objects” scan. Lingering Object Liquidator establishes event log “subscriptions” to each Target DC that it needs to scan. The tool then waits for the DC to log an event (Event ID 1942 in the Directory Service event log) signaling that lingering object detection has completed for a specific naming context. Only once a certain number of those events (depending on your choices in the “Naming Contexts” drop-down menu,) have been received from the remote domain controller, does the tool know that particular domain controller has been fully scanned. However, there is an overall timeout, and if the tool does not receive the requisite number of Event ID 1942s in the allotted time, the tool “gives up” and proceeds to the next domain controller.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: ThreadCount
Type: DWORD
Default: 8 threads

This setting sets the maximum number of threads to use during the “Detect Lingering Objects” scan. Using more threads may decrease the overall time it takes to complete a scan, especially in very large environments.

Tips

The domain controllers must allow the network connectivity required for remote event log management for Lingering Object Liquidator to work. You can enable the required Windows Firewall rules using the following line of PowerShell:

Get-NetFirewallRule -DisplayName "Remote Event Log*" | Enable-NetFirewallRule

Check the download site often for new versions! (There’s also a handy “check for updates” option in the tool.)

Final Words

We provide this tool because we at AskDS want your Active Directory lingering object removal experience to go as smoothly as possible. If you find any bugs or have any feature requests, please drop a note to our public contact alias.

Download: http://aka.ms/msftlol

Release Notes

v2.0.19:
– Initial release to the public.

v2.0.21:
– Added new radio buttons that allow the user more control over which lingering object removal method they want to use – the DsReplicaVerifyObjects method or removeLingeringObject method.
– Fixed issue with Export button not displaying the full path of the export file.
– Fixed crash when unexpected or corrupted data is returned from event log subscription.

↧

ESE Deep Dive: Part 1: The Anatomy of an ESE database

December 4, 2017, 10:33 am

≫ Next: TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

≪ Previous: Introducing Lingering Object Liquidator v2

hi!

Get your crash helmets on and strap into your seatbelts for a JET engine / ESE database special…

This is Linda Taylor, Senior AD Escalation Engineer from the UK here again. And WAIT…… I also somehow managed to persuade Brett Shirley to join me in this post. Brett is a Principal Software Engineer in the ESE Development team so you can be sure the information in this post is going to be deep and confusing but really interesting and useful and the kind you cannot find anywhere else :- )
BTW, Brett used to write blogs before he grew up and got very busy. And just for fun, you might find this old “Brett” classic entertaining. I have never forgotten it. :- )
Back to today’s post…this will be a rather more grown up post, although we will talk about DITs but in a very scientific fashion.

In this post, we will start from the ground up and dive deep into the overall file format of an ESE database file including practical skills with esentutl such as how to look at raw database pages. And as the title suggests this is Part1 so there will be more!

What is an ESE database?

Let’s start basic. The Extensible Storage Engine (ESE), also known as JET Blue, is a database engine from Microsoft that does not speak SQL. And Brett also says … For those with a historical bent, or from academia, and remember ‘before SQL’ instead of ‘NoSQL’ ESE is modelled after the ISAMs (indexed sequential access method) that were vogue in the mid-70s. ;-p
If you work with Active Directory (which you must do if you are reading this post then you will (I hope!) know that it uses an ESE database. The respective binary being, esent.dll (or Brett loves exchange, it’s ese.dll for the Exchange Server install). Applications like active directory are all ESE clients and use the JET APIs to access the ESE database.

This post will dive deep into the Blue parts above. The ESE side of things. AD is one huge client of ESE, but there are many other Windows components which use an ESE database (and non-Microsoft software too), so your knowledge in this area is actually very applicable for those other areas. Some examples are below:

Tools

There are several built-in command line tools for looking into an ESE database and related files.

esentutl. This is a tool that ships in Windows Server by default for use with Active Directory, Certificate Authority and any other built in ESE databases. This is what we will be using in this post and can be used to look at any ESE database.

eseutil. This is the Exchange version of the same and gets installed typically in the Microsoft\Exchange\V15\Bin sub-directory of the Program Files directory.

ntdsutil. Is a tool specifically for managing an AD or ADLDS databases and cannot be used with generic ESE databases (such as the one produced by Certificate Authority service). This is installed by default when you add the AD DS or ADLDS role.

For read operations such as dumping file or log headers it doesn’t matter which tool you use. But for operations which write to the database you MUST use the matching tool for the application and version (for instance it is not safe to run esentutl /r from Windows Server 2016 on a Windows Server 2008 DB). Further throughout this article if you are looking at an Exchange database instead, you should use eseutil.exe instead of esentutl.exe. For AD and ADLDS always use ntdsutil or esentutl. They have different capabilities, so I use a mixture of both. And Brett says that If you think you can NOT keep the read operations straight from the write operations, play it safe and match the versions and application.

During this post, we will use an AD database as our victim example. We may use other ones, like ADLDS for variety in later posts.

Database logical format – Tables

Let’s start with the logical format. From a logical perspective, an ESE database is a set of tables which have rows and columns and indices.

Below is a visual of the list of tables from an AD database in Windows Server 2016. Different ESE databases will have different table names and use those tables in their own ways.

In this post, we won’t go into the detail about the DNTs, PDNTs and how to analyze an AD database dump taken with LDP because this is AD specific and here we are going to look at ESE specific level. Also, there are other blogs and sources where this has already been explained. for example, here on AskPFEPlat. However, if such post is wanted, tell me and I will endeavor to write one!!

It is also worth noting that all ESE databases have a table called MSysObjects and MSysObjectsShadow which is a backup of MSysObjects. These are also known as “the catalog” of the database and they store metadata about client’s schema of the database – i.e.

All the tables and their table names and where their associated B+ trees start in the database and other miscellaneous metadata.

All the columns for each table and their names (of course), the type of data stored in them, and various schema constraints.

All the indexes on the tables and their names, and where their associated B+ trees start in the database.

This is the boot-strap information for ESE to be able to service client requests for opening tables to eventually retrieve rows of data.

Database physical format

From a physical perspective, an ESE database is just a file on disk. It is a collection of fixed size pages arranged into B+ tree structures. Every database has its page size stamped in the header (and it can vary between different clients, AD uses 8 KB). At a high level it looks like this:

The first “page” is the Header (H).

The second “page” is a Shadow Header (SH) which is a copy of the header.

However, in ESE “page number” (also frequently abbreviated “pgno”) has a very specific meaning (and often shows up in ESE events) and the first NUMBERED page of the actual database is page number / pgno 1 but is actually the third “page” (if you are counting from the beginning :-).

From here on out though, we will not consider the header and shadow header proper pages, and page number 1 will be third page, at byte offset = <page size> * 2 = 8192 * 2 (for AD databases).

If you don’t know the page size, you can dump the database header with esentutl /mh.

Here is a dump of the header for an NTDS.DIT file – the AD database:

The page size is the cbDbPage. AD and ADLDS uses a page size of 8k. Other databases use different page sizes.

A caveat is that to be able to do this, the database must not be in use. So, you’d have to stop the NTDS service on the DC or run esentutl on an offline copy of the database.

But the good news is that in WS2016 and above we can now dump a LIVE DB header with the /vss switch! The command you need would be “esentutl /mh ntds.dit /vss” (note: must be run as administrator).

All these numbered database pages logically are “owned” by various B+ trees where the actual data for the client is contained … and all these B+ trees have a “type of tree” and all of a tree’s pages have a “placement in the tree” flag (Root, or Leaf or implicitly Internal – if not root or leaf).

Ok, Brett, that was “proper” tree and page talk – I think we need some pictures to show them…

Logically the ownership / containing relationship looks like this:

More about B+ Trees

The pages are in turn arranged into B+ Trees. Where top page is known as the ‘Root’ page and then the bottom pages are ‘Leaf’ pages where all the data is kept. Something like this (note this particular example does not show ‘Internal’ B+ tree pages):

The upper / parent page has partial keys indicating that all entries with 4245 + A* can be found in pgno 13, and all entries with 4245 + E* can be found in pgno 14, etc.
Note this is a highly simplified representation of what ESE does … it’s a bit more complicated.

This is not specific to ESE; many database engines have either B trees or B+ trees as a fundamental arrangement of data in their database files.

The Different trees

You should know that there are different types of B+ trees inside the ESE database that are needed for different purposes. These are:

Data / Primary Trees – hold the table’s primary records which are used to store data for regular (and small) column data.

Long Value (LV) Trees – used to store long values. In other words, large chunks of data which don’t fit into the primary record.

Index trees – these are B+Trees used to store indexes.

Space Trees – these are used to track what pages are owned and free / available as new pages for a given B+ tree. Each of the previous three types of B+ Tree (Data, LV, and index), may (if the tree is large) have a set of two space trees associated with them.

Storing large records

Each Row of a table is limited to 8k (or whatever the page size is) in Active Directory and AD LDS. I.e. so each record has to fit into a single database page of 8k..but you are probably aware that you can fit a LOT more than 8k into an AD object or an exchange e-mail! So how do we store large records?

Well, we have different types of columns as illustrated below:

Tagged columns can be split out into what we call the Long Value Tree. So in the tagged column we store a simple 4 byte number that’s called a LID (Long Value ID) which then points to an entry in the LV tree. So we take the large piece of data, break it up into small chunks and prefix those with the key for the LID and the offset.

So, if every part of the record was a LID / pointer to a LV, then essentially we can fit 1300 LV pointers onto the 8k page. btw, this is what creates the 1300 attribute limit in AD. It’s all down to the ESE page size.

Now you can also start to see that when you are looking at a whole AD object you may read pages from various trees to get all the information about your object. For example, for a user with many attributes and group memberships you may have to get data from a page in the ”datatable” \ Primary tree + “datatable” \ LV tree + sd_table \ Primary tree + link_table \ Primary tree.

Index Trees

An index is used for a couple of purposes. Firstly, to make a list of the records in an intelligent order, such as by surname in an alphabetical order. And then secondly to also cut down the number of records which sometimes greatly helps speed up searches (especially when the ‘selectivity is high’ – meaning few entries match).

Below is a visual illustration (with the B+ trees turned on their side to make the diagram easier) of a primary index which is the DNT index in the AD Database – the Data Tree. And a secondary index of dNSHostName. You can see that the secondary index only contains the records which has a dNSHostName populated. It is smaller.

You can also see that in the secondary index, the primary key is the data portion (the name) and then the data is the actual Key that links us back to the REAL record itself.

Inside a Database page

Each database page has a fixed header. And the header has a checksum as well as other information like how much free space is on that page and which B-tree it belongs to.

Then we have these things called TAGS (or nodes), which store the data.

A node can be many things, such as a record in a database table or an entry in an index.

The TAGS are actually out of order on the page, but order is established by the tag array at end.

TAG 0 = Page External Header

This contains variable sized special information on the page, depending upon the type of B-tree and type of page in B tree (space vs. regular tree, and root vs. leaf).

TAG 1,2,3, etc are all “nodes” or lines, and the order is tracked.

The key & data is specific to the B Tree type.

And TAG 1 is actually node 0!!! So here is a visual picture of what an ESE database page looks like:

It is possible to calculate this key if you have an object’s primary key. In AD this is a DNT.

The formulae for that (if you are ever crazy enough to need it) would be:

Start with 0x7F, and if it is a signed INT append a 0x80000000 and then OR in the number

For example 4248 –> in hex 1098 –> as key 7F80001098 (note 5 bytes).
Note: Key buffer uses big endian, not little endian (like x86/amd64 arch).
If it was a 64-bit int, just insert zeros in the middle (9 byte key).

If it is an unsigned INT, start with 0x7F and just append the number.
Note: Long Value (LID) trees and ESE’s Space Trees (pgno) are special, no 0x7F (4 byte keys).
And finally other non-integers column types, such as String and Binary types, have a different more complicated formatting for keys.

Why is this useful? Because, for example you can take a DNT of an object and then calculate its key and then seek to its page using esentutl.exe dump page /m functionality and /k option.

The Nodes also look different (containing different data) depending on the ESE B+tree type. Below is an illustration of the different nodes in a Space tree, a Data Tree, a LV tree and an Index tree.

The green are the keys. The dark blue is data.

What does a REAL page look like?

You can use esentutl to dump pages of the database if you are investigating some corruption for example.

Before we can dump a page, we want to find a page of interest (picking a random page could give you just a blank page) … so first we need some info about the table schema, so to start you can dump all the tables and their associated root page numbers like this :

Note, we have findstring’d the output again to get a nice view of just all the tables and their pgnoFDP and objidFDP. Findstr.exe is case sensitive so use the exact format or use /i switch.

objidFDP identifies this table in the catalog metadata. When looking at a database page we can use its objidFDP to tell which table this page belongs to.

pgnoFDP is the page number of the Father Data Page – the very top page of that B+ tree, also known as the root page. If you run esentutl /mm <dbname> on its own you will see a huge list of every table and B-tree (except internal “space” trees) including all the indexes.

So, in this example page 31 is the root page of the datatable here.

Dumping a page

You can dump a page with esentutl using /m and /p. Below is an example of dumping page 31 from the database – the root page of the “datatable” table as above.

The objidFDP is the number indicating which B-tree the page belongs to. And the cbFree tells us how much of this page is free. (cb = count of bytes). Each database page has a double header checksum – one ECC (Error Correcting Code) checksum for single bit data correction, and a higher fidelity XOR checksum to catch all other errors, including 3 or more bit errors that the ECC may not catch. In addition, we compute a logged data checksum from the page data, but this is not stored in the header, and only utilized by the Exchange 2016 Database Divergence Detection feature.

You can see this is a root page and it has 3 nodes (4 TAGS – remember TAG1 is node 0 also known as line 0! and it is nearly empty! (cbFree = 8092 bytes, so only 100 bytes used for these 3 nodes + page header + external header).

The objidFDP tells us which B-Tree this page belongs to.

And notice the PageFlushType, which is related to the JET Flush Map file we could talk about in another post later.

The nodes here point to pages lower down in the tree. And we could dump a next level page (pgno: 1438)….and we can see them getting deeper and more spread out with more nodes.

So you can see this page has 294 nodes! Which again all point to other pages. It is also a ParentOfLeaf meaning these pgno / page numbers actually point to leaf pages (with the final data on them).

Are you bored yet? Open-mouthed smile

Or are you enjoying this like a geek? either way, we are nearly done with the page internals and the tree climbing here.

If you navigate more down, eventually you will get a page with some data on it like this for example, let’s dump page 69 which TAG 6 is pointing to:

So this one has some data on it (as indicated by the “Leaf page” indicator under the fFlags).

Finally, you can also dump the data – the contents of a node (ie TAG) with the /n switch like this:

Remember: The /n specifier takes a pgno : line or node specifier … this means that the :3 here, dumped TAG 4 from the previous screen. And note that trying to dump “/n69:4” would actually fail.

This /n will dump all the raw data on the page along with the information of columns and their contents and types. The output also needs some translation because it gives us the columnID (711 in the above example) and not the attribute name in AD (or whatever your database may be). The application developer would then be able to translate those column IDs to some meaningful information. For AD and ADLDS, we can translate those to attribute names using the source code.

Finally, there really should be no need to do this in real life, other than in a situation where you are debugging a database problem. However, we hope this provided a good and ‘realistic’ demo to help understand and visualize the structure of an ESE database and how the data is stored inside it!

Stay tuned for more parts …. which Brett says will be significantly more useful to everyday administrators!

The End!

Linda & Brett

↧

TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

April 10, 2018, 2:19 pm

≫ Next: Deep Dive: Active Directory ESE Version Store Changes in Server 2019

≪ Previous: ESE Deep Dive: Part 1: The Anatomy of an ESE database

Hi There!

Marius and Tolu from the Directory Services Escalation Team.

Today, we’re going to talk about a little twist on some scenarios you may have come across at some point, where TLS connections fail or timeout for a variety of reasons.

You’re probably already familiar with some of the usual suspects like cipher suite mismatches, certificate validation errors and TLS version incompatibility, to name a few.

Here are just some examples for illustration (but there is a wealth of information out there)

Recently we’ve seen a number of cases with a variety of symptoms affecting different customers which all turned out to have a common root cause.

We’ve managed to narrow it down to an unlikely source; a built-in OS feature working in its default configuration.

We’re talking about the automatic root update and automatic disallowed roots update mechanisms based on CTLs.

Starting with Windows Vista, root certificates are updated on Windows automatically.

When a user on a Windows client visits a secure Web site (by using HTTPS/TLS), reads a secure email (S/MIME), or downloads an ActiveX control that is signed (code signing) and encounters a certificate which chains to a root certificate not present in the root store, Windows will automatically check the appropriate Microsoft Update location for the root certificate.

If it finds it, it downloads it to the system. To the user, the experience is seamless; they don’t see any security dialog boxes or warnings and the download occurs automatically, behind the scenes.

Additional information in:

How Root Certificate Distribution Works

During TLS handshakes, any certificate chains involved in the connection will need to be validated, and, from Windows Vista/2008 onwards, the automatic disallowed root update mechanism is also invoked to verify if there are any changes to the untrusted CTL (Certificate Trust List).

A certificate trust list (CTL) is a predefined list of items that are authenticated and signed by a trusted entity.

The mechanism is described in more detail in the following article:

An automatic updater of untrusted certificates is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2

It expands on the automatic root update mechanism technology (for trusted root certificates) mentioned earlier to let certificates that are compromised or are untrusted in some way be specifically flagged as untrusted.

Customers therefore benefit from periodic automatic updates to both trusted and untrusted CTLs.

So, after the preamble, what scenarios are we talking about today?

Here are some examples of issues we’ve come across recently.

Your users may experience browser errors after several seconds when trying to browse to secure (https) websites behind a load balancer.
      They might receive an error like “The page cannot be displayed. Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in the Advanced settings and try connecting to https://contoso.com again.  If this error persists, contact your site administrator.”
      If they try to connect to the website via the IP address of the server hosting the site, the https connection works after showing a certificate name mismatch error.
      All TLS versions ARE enabled when checking in the browser settings:

Internet Options

      You have a 3rd party appliance making TLS connections to a Domain Controller via LDAPs (Secure LDAP over SSL) which may experience delays of up to 15 seconds during the TLS handshake
      The issue occurs randomly when connecting to any eligible DC in the environment targeted for authentication.
      There are no intervening devices that filter or modify traffic between the appliance and the DCs

2a)

A very similar scenario* to the above is in fact described in the following article by our esteemed colleague, Herbert:

Understanding ATQ performance counters, yet another twist in the world of TLAs

Where he details:

Scenario 2

DC supports LDAP over SSL/TLS

A user sends a certificate on a session. The server need to check for certificate revocation which may take some time.*

This becomes problematic if network communication is restricted and the DC cannot reach the Certificate Distribution Point (CDP) for a certificate.

To determine if your clients are using secure LDAP (LDAPs), check the counter “LDAP New SSL Connections/sec”.

If there are a significant number of sessions, you might want to look at CAPI-Logging.

A 3^rd party meeting server performing LDAPs queries against a Domain Controller may fail the TLS handshake on the first attempt after surpassing a pre-configured timeout (e.g 5 seconds) on the application side
Subsequent connection attempts are successful

So, what’s the story? Are these issues related in anyway?

Well, as it turns out, they do have something in common.

As we mentioned earlier, certificate chain validation occurs during TLS handshakes.

Again, there is plenty of documentation on this subject, such as

During certificate validation operations, the CTL engine gets periodically invoked to verify if there are any changes to the untrusted CTLs.

In the example scenarios we described earlier, if the default public URLs for the CTLs are unreachable, and there is no alternative internal CTL distribution point configured (more on this in a minute), the TLS handshake will be delayed until the WinHttp call to access the default CTL URL times out.

By default, this timeout is usually around 15 seconds, which can cause problems when load balancers or 3^rd party applications are involved and have their own (more aggressive) timeouts configured.

If we enable CAPI2 Diagnostic logging, we should be able to see evidence of when and why the timeouts are occurring.

We will see events like the following:

Event ID 20 – Retrieve Third-Party Root Certificate from Network:

Trusted CTL attempt

Trusted CTL Attempt

Disallowed CTL attempt

Disallowed CTL Attempt

Event ID 53 error message details showing that we have failed to access the disallowed CTL:

Event ID 53

The following article gives a more detailed overview of the CAPI2 diagnostics feature available on Windows systems, which is very useful when looking at any certificate validation operations occurring on the system:

Troubleshooting PKI Problems on Windows Vista

To help us confirm that the CTL updater engine is indeed affecting the TLS delays and timeouts we’ve described, we can temporarily disable it for both the trusted and untrusted CTLs and then attempt our TLS connections again.

To disable it:

 Create a backup of this registry key (export and save a copy)

HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\SystemCertificates\AuthRoot

Then create the following DWORD registry values under the key

“EnableDisallowedCertAutoUpdate”=dword:00000000

“DisableRootAutoUpdate”=dword:00000001

After applying these steps, you should find that your previously failing TLS connections will no longer timeout. Your symptoms may vary slightly, but you should see speedier connection times, because we have eliminated the delay in trying and failing to reach the CTL URLs.

So, what now?

We should now REVERT the above registry changes by restoring the backup we created, and evaluate the following, more permanent solutions.

We previously stated that disabling the updater engine should only be a temporary measure to confirm the root cause of the timeouts in the above scenarios.

For the untrusted CTL:

The automatic disallowed root update mechanism is a built-in OS feature, so we can consider allowing access to the public Microsoft disallowed CTL URL from users’ machines; http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/disallowedcertstl.cab

OR, we can configure and maintain an internal untrusted CTL distribution point as outlined in Configure Trusted Roots and Disallowed Certificates

For the trusted CTL:

For server systems you might consider deploying the trusted 3rd party CA certificates via GPO on an as needed basis

Manage Trusted Root Certificates 

(particularly to avoid hitting the TLS protocol limitation described here:

SSL/TLS communication problems after you install KB 931125 )

For client systems, you should consider

Allowing access to the public allowed Microsoft CTL URL http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab

Defining and maintaining an internal trusted CTL distribution point as outlined in Configure Trusted Roots and Disallowed Certificates

If you require a more granular control of which CAs are trusted by client machines, you can deploy the 3^rd Party CA certificates as needed via GPO

Manage Trusted Root Certificates

So there you have it. We hope you found this interesting, and now have an additional factor to take into account when troubleshooting TLS/SSL communication failures.

↧

Deep Dive: Active Directory ESE Version Store Changes in Server 2019

October 2, 2018, 4:39 pm

≫ Next: AskDS Is Moving!

≪ Previous: TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

Hey everybody. Ryan Ries here to help you fellow AD ninjas celebrate the launch of Server 2019.

Warning: As is my wont, this is a deep dive post. Make sure you’ve had your coffee before proceeding.

Last month at Microsoft Ignite, many exciting new features rolling out in Server 2019 were talked about. (Watch MS Ignite sessions here and here.)

But now I want to talk about an enhancement to on-premises Active Directory in Server 2019 that you won’t read or hear anywhere else. This specific topic is near and dear to my heart personally.

The intent of the first section of this article is to discuss how Active Directory’s sizing of the ESE version store has changed in Server 2019 going forward. The second section of this article will discuss some basic debugging techniques related to the ESE version store.

Active Directory, also known as NT Directory Services (NTDS,) uses Extensible Storage Engine (ESE) technology as its underlying database.

One component of all ESE database instances is known as the version store. The version store is an in-memory temporary storage location where ESE stores snapshots of the database during open transactions. This allows the database to roll back transactions and return to a previous state in case the transactions cannot be committed. When the version store is full, no more database transactions can be committed, which effectively brings NTDS to a halt.

In 2016, the CSS Directory Services support team blog, (also known as AskDS,) published some previously undocumented (and some lightly-documented) internals regarding the ESE version store. Those new to the concept of the ESE version store should read that blog post first.

In the blog post linked to previously, it was demonstrated how Active Directory had calculated the size of the ESE version store since AD’s introduction in Windows 2000. When the NTDS service first started, a complex algorithm was used to calculate version store size. This algorithm included the machine’s native pointer size, number of CPUs, version store page size (based on an assumption which was incorrect on 64-bit operating systems,) maximum number of simultaneous RPC calls allowed, maximum number of ESE sessions allowed per thread, and more.

Since the version store is a memory resource, it follows that the most important factor in determining the optimal ESE version store size is the amount of physical memory in the machine, and that – ironically – seems to have been the only variable not considered in the equation!

The way that Active Directory calculated the version store size did not age well. The original algorithm was written during a time when all machines running Windows were 32-bit, and even high-end server machines had maybe one or two gigabytes of RAM.

As a result, many customers have contacted Microsoft Support over the years for issues arising on their domain controllers that could be attributed to or at least exacerbated by an undersized ESE version store. Furthermore, even though the default ESE version store size can be augmented by the “EDB max ver pages (increment over the minimum)” registry setting, customers are often hesitant to use the setting because it is a complex topic that warrants heavier and more generous amounts of documentation than what has traditionally been available.

The algorithm is now greatly simplified in Server 2019:

– When NTDS first starts, the ESE version store size is now calculated as 10% of physical RAM, with a minimum of 400MB and a maximum of 4GB.

The same calculation applies to physical machines and virtual machines. In the case of virtual machines with dynamic memory, the calculation will be based off of the amount of “starting RAM” assigned to the VM. The “EDB max ver pages (increment over the minimum)” registry setting can still be used, as before, to add additional buckets over the default calculation. (Even beyond 4GB if desired.) The registry setting is in terms of “buckets,” not bytes. Version store buckets are 32KB each on 64-bit systems. (They are 16KB on 32-bit systems, but Microsoft no longer supports any 32-bit server OSes.) Therefore, if one adds 5000 “buckets” by setting the registry entry to 5000 (decimal,) then 156MB will be added to the default version store size. A minimum of 400MB was chosen for backwards compatibility because when using the old algorithm, the default version store size for a DC with a single 64-bit CPU was ~410MB, regardless of how much memory it had. (There is no way to configure less than the minimum of 400MB, similar to previous Windows versions.) The advantage of the new algorithm is that now the version store size scales linearly with the amount of memory the domain controller has, when previously it did not.

Defaults:

Physical Memory in the Domain Controller	Default ESE Version Store Size
1GB	400MB
2GB	400MB
3GB	400MB
4GB	400MB
5GB	500MB
6GB	600MB
8GB	800MB
12GB	1.2GB
24GB	2.4GB
48GB	4GB
128GB	4GB

This new calculation will result in larger default ESE version store sizes for domain controllers with greater than 4GB of physical memory when compared to the old algorithm. This means more version store space to process database transactions, and fewer cases of version store exhaustion. (Which means fewer customers needing to call us!)

Note: This enhancement currently only exists in Server 2019 and there are not yet any plans to backport it to older Windows versions.

Note: This enhancement applies only to Active Directory and not to any other application that uses an ESE database such as Exchange, etc.

ESE Version Store Advanced Debugging and Troubleshooting

This section will cover some basic ESE version store triage, debugging and troubleshooting techniques.

As covered in the AskDS blog post linked to previously, the performance counter used to see how many ESE version store buckets are currently in use is:

\\.\Database ==> Instances(lsass/NTDSA)\Version buckets allocated

Once that counter has reached its limit, (~12,000 buckets or ~400MB by default,) events will be logged to the Directory Services event log, indicating the exhaustion:

Figure 1: NTDS version store exhaustion.

The event can also be viewed graphically in Performance Monitor:

Figure 2: The plateau at 12,314 means that the performance counter “Version Buckets Allocated” cannot go any higher. The flat line represents a dead patient.

As long as the domain controller still has available RAM, try increasing the version store size using the previously mentioned registry setting. Increase it in gradual increments until the domain controller is no longer exhausting the ESE version store, or the server has no more free RAM, whichever comes first. Keep in mind that the more memory that is used for version store, the less memory will be available for other resources such as the database cache, so a sensible balance must be struck to maintain optimal performance for your workload. (i.e. no one size fits all.)

If the “Version Buckets Allocated” performance counter is still pegged at the maximum amount, then there is some further investigation that can be done using the debugger.

The eventual goal will be to determine the nature of the activity within NTDS that is primarily responsible for exhausting the domain controller of all its version store, but first, some setup is required.

First, generate a process memory dump of lsass on the domain controller while the machine is “in state” – that is, while the domain controller is at or near version store exhaustion. To do this, the “Create dump file” option can be used in Task Manager by right-clicking on the lsass process on the Details tab. Optionally, another tool such as Sysinternals’ procdump.exe can be used (with the -ma switch .)

In case the issue is transient and only occurs when no one is watching, data collection can be configured on a trigger, using procdump with the -p switch.

Note: Do not share lsass memory dump files with unauthorized persons, as these memory dumps can contain passwords and other sensitive data.

It is a good idea to generate the dump after the Version Buckets Allocated performance counter has risen to an abnormally elevated level but before version store has plateaued completely. The reason why is because the database transaction responsible may be terminated once the exhaustion occurs, therefore the thread would no longer be present in the memory dump. If the guilty thread is no longer alive once the memory dump is taken, troubleshooting will be much more difficult.

Next, gather a copy of %windir%\System32\esent.dll from the same Server 2019 domain controller. The esent.dll file contains a debugger extension, but it is highly dependent upon the correct Windows version, or else it could output incorrect results. It should match the same version of Windows as the memory dump file.

Next, download WinDbg from the Microsoft Store, or from this link.

Once WinDbg is installed, configure the symbol path for Microsoft’s public symbol server:

$Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol$

Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol

Now load the lsass.dmp memory dump file, and load the esent.dll module that you had previously collected from the same domain controller:

Figure 4: .load esent.dll

Now the ESE database instances present in this memory dump can be viewed with the command !ese dumpinsts:

Figure 5: !ese dumpinsts – The only ESE instance present in an lsass dump on a DC should be NTDSA.

Notice that the current version bucket usage is 11,189 out of 12,802 buckets total. The version store in this memory dump is very nearly exhausted. The database is not in a particularly healthy state at this moment.

The command !ese param <instance> can also be used, specifying the same database instance gotten from the previous command, to see global configuration parameters for that ESE database instance. Notice that JET_paramMaxVerPages is set to 12800 buckets, which is 400MB worth of 32KB buckets:

Figure 6: !ese param

To see much more detail regarding the ESE version store, use the !ese verstore <instance> command, specifying the same database instance:

Figure 7: !ese verstore

The output of the command above shows us that there is an open, long-running database transaction, how long it’s been running, and which thread started it. This also matches the same information displayed in the Directory Services event log event pictured previously.

Neither the event log event nor the esent debugger extension were always quite so helpful; they have both been enhanced in recent versions of Windows.

In older versions of the esent debugger extension, the thread ID could be found in the dwTrxContext field of the PIB, (command: !ese dump PIB 0x000001AD71621320) and the start time of the transaction could be found in m_trxidstack as a 64-bit file time. But now the debugger extension extracts that data automatically for convenience.

Switch to the thread that was identified earlier and look at its call stack:

Figure 8: The guilty-looking thread responsible for the long-running database transaction.

The four functions that are highlighted by a red rectangle in the picture above are interesting, and here’s why:

When an object is deleted on a domain controller, and that object has links to other objects, those links must also be deleted/cleaned by the domain controller. For example, when an Active Directory user becomes a member of a security group, a database link between the user and the group is created that represents that relationship. The same principle applies to all linked attributes in Active Directory. If the Active Directory Recycle Bin is enabled, then the link-cleaning process will be deferred until the deleted object surpasses its Deleted Object Lifetime – typically 60 or 180 days after being deleted. This is why, when the AD Recycle Bin is enabled, a deleted user can be easily restored with all of its group memberships still intact – because the user account object’s links are not cleaned until after its time in the Recycle Bin has expired.

The trouble begins when an object with many backlinks is deleted. Some security groups, distribution lists, RODC password replication policies, etc., may contain hundreds of thousands or even millions of members. Deleting such an object will give the domain controller a lot of work to do. As you can see in the thread call stack shown above, the domain controller had been busily processing links on a deleted object for 47 seconds and still wasn’t done. All the while, more and more ESE version store space was being consumed.

When the AD Recycle Bin is enabled, this can cause even more confusion, because no one remembers that they deleted that gigantic security group 6 months ago. A time bomb has been sitting in the AD Recycle Bin for months. But suddenly, AD replication grinds to a standstill throughout the domain and the admins are scrambling to figure out why.

The performance counter “\\.\DirectoryServices ==> Instances(NTDS)\Link Values Cleaned/sec” would also show increased activity during this time.

There are two main ways to fight this: either by increasing version store size with the “EDB max ver pages (increment over the minimum)” registry setting, or by decreasing the batch size with the “Links process batch size” registry setting, or a combination of both. Domain controllers process the deletion of these links in batches. The smaller the batch size, the shorter the individual database transactions will be, thus relieving pressure on the ESE version store.

Though the default values are properly-sized for almost all Active Directory deployments and most administrators should never have to worry about them, the two previously-mentioned registry settings are supported and well-informed enterprise administrators are encouraged to tweak the values – within reason – to avoid ESE version store depletion. Contact Microsoft customer support before making any modifications if there is any uncertainty.

At this point, one could continue diving deeper, using various approaches (e.g. consider not only debugging process memory, but also consulting DS Object Access audit logs, object metadata from repadmin.exe, etc.) to find out which exact object with many thousands of links was just deleted, but in the end that’s a moot point. There’s nothing else that can be done with that information. The domain controller simply must complete the work of link processing.

In other situations however, it will be apparent using the same techniques shown previously, that it’s an incoming LDAP query from some network client that’s performing inefficient queries, leading to version store exhaustion. Other times it will be DirSync clients. Other times it may be something else. In those instances, there may be more you can do besides just tweaking the version store variables, such as tracking down and silencing the offending network client(s), optimizing LDAP queries, creating database indices, etc..

Thanks for reading,
– Ryan “Where’s My Ship-It Award” Ries

↧

AskDS Is Moving!

March 26, 2019, 8:08 am

≪ Previous: Deep Dive: Active Directory ESE Version Store Changes in Server 2019

Hello readers. The AskDS blog, as with all the other TechNet and MSDN blogs, will be moving to a new home on TechCommunity. The migration is currently in progress. This post will be updated with the new URL when the migration is complete. Thanks!... Read more

↧