Friends,
This post is about my recent experience on how I went with troubleshooting the frequent AOS Crashes we encountered in one of the client's environment.
Pattern of AX Crash:
We have 4 AOS Servers, 2 of them dedicated to clients and rest of the 2 to share the load for Batches and SSRS Reports.
We had frequent crashes on the 2 AOS's which served the clients.
How we knew it was a crash ?
Because when the crash happened, all of a sudden both the AOS Services goes into stopped mode - meaning they are no longer running and the users complain they can't connect to AX. It's very frustrating for the users as their application goes down when they are in the middle of something.
Some of the errors seen on the event logs were:
LCS Crash Analysis comes to Rescue:
I came to know there's a cool utility available in LCS known as Crash Dump Analysis. Upon doing some search, I hit upon this interesting blog from MSDN which shows the steps to get started with Crash Analysis
Basically, Crash Dump Analysis is a tool which will help you to evaluate the reason for AOS crash.
The input needed for this tool is a mini dump file which is generated when the AOS crashes. We can think of this similar to a Windows mini dump file which gets generated when Windows OS crashes.
There are many tools available out there to generate the dump file and out of those, I tried WER
WER (Windows Error Reporting)
WER is built into Windows 2008/2008 R2 can be configured to automatically create and store memory dumps from an AOS crash
The configuration happens by making some Registry modifications, more concrete creating the following Registry Key:
HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps\Ax32Serv.exe
In this new Registry Key several Registry Values have to be added as well (DumpFolder, DumpCount, DumpType, CustomDumpFlags).
Please note:
Do NOT set the DumpType to 1 or 2, but set the DumpType to 0 and set CustomDumpFlags to 7015 decimal (0x1B67 hexadecimal). Invalid settings will generate dumps without the required information.
Below is a snapshot of how the registry looks after I made the change.
There's another good tool Debug Diagnostic Tool (v2) which can also be used to capture dump files. For more detailed insights, please refer to the instructions in this blog
I chose WER as there is no installation needed. It's just enabling some registry keys and it generates the crash dump file. Once a dmp zip file is generated, it serves as the input to your LCS Crash and Dump analysis tool.
A detailed step by step process to upload the dump file is shown here
Some tips when uploading the dmp file:
a. The dmp file needs to be in a zip format before it gets uploaded for analysis
b. I have observed when uploading large files one needs to keep an eye on the browser connection dropout. I just keep the browser active by clicking on the browser address bar
Once the file is uploaded and the analysis is done, it produces a html report.
The report came with the following recommendations:
a. Recommended Kernel upgrade to CU9 - High risk (340 days old)
b. Recommended hotfixes to be applied related to AOS Crash
c. Exception - Memory Access violation
In our case, the main reason for the crash was caused due to a Memory Access Violation and this was due to a custom SSRS report run by an user. We fixed that report and now no longer the crash occurs.
So, to conclude, next time when you come across an AOS Crash, it's highly recommended to try out the LCS Crash analysis tool to get to the root of the problem and fix it. Hope this post helps you in getting used to the Crash Analysis tool.
ReplyDeleteThanks for the post.
FYI: We have a solution in place to help reduce issues caused by the crashes; i.e. to get the AOS back up immediately after a crash so users aren't severly impacted.
We have a custom event log which uses the following XML filter:
<QueryList>
<Query Id="0" Path="Application">
<Select Path="Application">*[System[Provider[@Name='Windows Error Reporting'] and (Level=4 or Level=0) and (EventID=1001)]] and *[EventData[(Data='APPCRASH' ) and (Data='Ax32Serv.exe')]] </Select>
</Query>
</QueryList>
We then have a batch file which records the status of all AX services on all machines, starts the AOS on the service affected by the crash, records the status of all AX services on all machines after the crash, then mails the support team and our service desk with the logs so we're aware that the crash occurred (so we can look into why, and also be ready for any related users reporting issues at that time), with the logs attached (so we can see if any other services had been affected / have proof that the service came back up as expected / can see if the crash caused the service to stop in the first place, or if the event was raised without stopping the service).
This is definitely a hacky solution, but it's also very effective, and saves us a lot of pain should something crash out of hours, as it ensures the system's available again immediately, wihtout waiting for support (i.e. as we don't have 24/7 coverage).
Should it help others, here's some code to create those registry entries for you:
ReplyDeleteclear-host
$dumpPath = 'C:\Program Files\Microsoft Dynamics AX\60\Server\MyInstanceName\Log' #change this for your desired dump directory
$registryPath = 'HKLM:\\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps\Ax32Serv.exe'
new-item -ItemType Directory -Path $dumpPath -Force | out-null
$key = New-Item -Path $registryPath -ItemType Directory -Force
$key | New-ItemProperty -Name 'CustomDumpFlags' -PropertyType DWord -Value 7015 -Force | out-null
$key | New-ItemProperty -Name 'DumpCount' -PropertyType DWord -Value 3 -Force | out-null
$key | New-ItemProperty -Name 'DumpFolder' -PropertyType ExpandString -Value $dumpPath -Force | out-null
$key | New-ItemProperty -Name 'DumpType' -PropertyType DWord -Value 0 -Force | out-null
Dump Analisys Tool is a good choise when you need to identified AOS crash in AX, but you need some advance technical skills into debugging and about how to related the issues Axapta with de Tool.
ReplyDelete