This post is about my recent experience on how I went with troubleshooting the frequent AOS Crashes we encountered in one of the client's environment.
Pattern of AX Crash:
We have 4 AOS Servers, 2 of them dedicated to clients and rest of the 2 to share the load for Batches and SSRS Reports.
We had frequent crashes on the 2 AOS's which served the clients.
How we knew it was a crash ?
Because when the crash happened, all of a sudden both the AOS Services goes into stopped mode - meaning they are no longer running and the users complain they can't connect to AX. It's very frustrating for the users as their application goes down when they are in the middle of something.
Some of the errors seen on the event logs were:
LCS Crash Analysis comes to Rescue:
I came to know there's a cool utility available in LCS known as Crash Dump Analysis. Upon doing some search, I hit upon this interesting blog from MSDN which shows the steps to get started with Crash Analysis
Basically, Crash Dump Analysis is a tool which will help you to evaluate the reason for AOS crash.
The input needed for this tool is a mini dump file which is generated when the AOS crashes. We can think of this similar to a Windows mini dump file which gets generated when Windows OS crashes.
There are many tools available out there to generate the dump file and out of those, I tried WER
WER (Windows Error Reporting)
WER is built into Windows 2008/2008 R2 can be configured to automatically create and store memory dumps from an AOS crash
The configuration happens by making some Registry modifications, more concrete creating the following Registry Key:
HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps\Ax32Serv.exe
In this new Registry Key several Registry Values have to be added as well (DumpFolder, DumpCount, DumpType, CustomDumpFlags).
Do NOT set the DumpType to 1 or 2, but set the DumpType to 0 and set CustomDumpFlags to 7015 decimal (0x1B67 hexadecimal). Invalid settings will generate dumps without the required information.
Below is a snapshot of how the registry looks after I made the change.
There's another good tool Debug Diagnostic Tool (v2) which can also be used to capture dump files. For more detailed insights, please refer to the instructions in this blog
I chose WER as there is no installation needed. It's just enabling some registry keys and it generates the crash dump file. Once a dmp zip file is generated, it serves as the input to your LCS Crash and Dump analysis tool.
A detailed step by step process to upload the dump file is shown here
Some tips when uploading the dmp file:
a. The dmp file needs to be in a zip format before it gets uploaded for analysis
b. I have observed when uploading large files one needs to keep an eye on the browser connection dropout. I just keep the browser active by clicking on the browser address bar
Once the file is uploaded and the analysis is done, it produces a html report.
The report came with the following recommendations:
a. Recommended Kernel upgrade to CU9 - High risk (340 days old)
b. Recommended hotfixes to be applied related to AOS Crash
c. Exception - Memory Access violation
In our case, the main reason for the crash was caused due to a Memory Access Violation and this was due to a custom SSRS report run by an user. We fixed that report and now no longer the crash occurs.
So, to conclude, next time when you come across an AOS Crash, it's highly recommended to try out the LCS Crash analysis tool to get to the root of the problem and fix it. Hope this post helps you in getting used to the Crash Analysis tool.