IT Support: System Hang Analysis
IT Support: System Hang Analysis
Fyi.
Santosh Bhise
Technical Support - IBM - FinancePlus India
E-mail :- bhisesan@wm.ibm.com | Mobile :- 8108168778
5th Floor, Empire Plaza II, Situated On CTS No 9, Village Hariyali, LBS Marg, Near Gandhi Nagar, Vikhroli (West) Mumbai - 400083
“For all IT Issues please log a call in Worldwide IT Self-Service – “https://wppit.service-now.com/selfservice/raise_incident.do”
Dear Santosh,
With reference to the below action plan from MS support, please raise a support ticket with TSS for the latest firmware & drivers update.
Regards,
Ashutosh
Hi Ashutosh,
I understand that the issue is intermittent and had occurred only once so far where the server is currently up and running fine.
I have reviewed the details and diagnostic logs provided and framed out the next plan of action considering the nature and frequency of the issue to do an effective root
cause. Kindly read through the email completely and do let me know if you have any queries on any mentioned pointers.
I would also want to inform you that the server hang scenarios are one of the exceptions that are to be captured in the issue state to find the root cause. And in such
scenarios, the most effective approach would be dumping the machine and capturing a system full/ complete hang dump (along with a remote performance monitor trace)
since a complete memory dump file records all the contents of system memory (Kernel + User Mode Content) and it will contain data from processes that were running when
the memory dump file was collected. By looking at the hang dumps, we will be able to identify the cause of the issue.
So, it is important that we capture the right set of data at the time of issue, and hence my recommendation to capture a memory dump along with a Remote Perfmon at the
next occurrence to ensure we snapshot the whole server to identify the culprit.
It is also difficult to draft an RCA just with MS Diagnostics logs as they will only offer the environmental overview of the server and cannot be used for System hang analysis.
When dealing with server hangs, it is important to distinguish between what we call a hard hang and a soft hang. This will oftentimes help us at least diagnose the basic
problem based on what we can and cannot do on the machine. For example, if we are unable to ping the server, toggle the NumLock or Caps Lock functions via the keyboard
or get any sort of mouse cursor responsiveness, then we are most likely dealing with a hard hang. These issues are generally hardware related (possibly driver related), but
seldom due to a Windows OS configuration issue or Memory Leak. In the event of a hard hang, the first step is to contact your hardware vendor to run a diagnostic on the
system.
When a machine is in a soft hang state it is mostly unresponsive, but the kernel is still functional at a very low level – for example, the ping test, or toggling Numlock will work
fine. In a soft hang state, you may not be able to log on to the machine either locally or via Terminal Services or you may experience a blank desktop – however network and
printer shares may still be accessible. This is more typical of the type of symptoms we see during memory depletion or a process deadlock.
I also looked at the diagnostic logs uploaded and below are my findings, recommendation, and preventive action plans:
2
MSDT Logs review
Error 8/10/2021 11:44:59 AM Service Control Manager 7022 None The HP ProLiant Agentless Management Service service hung on starting.
Error 8/10/2021 11:45:02 AM Service Control Manager 7022 None The Internet Connection Sharing (ICS) service hung on starting.
Error 8/10/2021 11:45:32 AM Service Control Manager 7022 None The VisualCron service hung on starting.
Error 8/10/2021 11:49:39 AM Service Control Manager 7022 None The Background Intelligent Transfer Service service hung on starting.
---------------------------------------------------------------------
3
Registry Current size: 171 MB Percent In Use: % Limit: MB RegistrySizeLimit: <value not present>
WMI Repository Current size: 131.15 MB (c:\windows\system32\wbem\repository\objects.data)
Page Files C:\pagefile.sys Current: 2432 MB Initial: <system managed> Maximum: <system managed>
---------------------------------------------------------------------
===========================================================================================================================================
========
LAST 5 RESTARTS (6005 after 6006 = CLEAN, 6005 after 6008 = DIRTY, 6005 after 41 = DIRTY)
===========================================================================================================================================
========
Type Date/Time ID Source Description
-------------------------------------------------------------------------------------------------------------------------------------------
--------
Tue Aug 10 2021 11:40 AM 41 Microsoft-Windows-Kernel-PowerThe system has rebooted without cleanly shutting down first.
This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
DIRTY Tue Aug 10 2021 11:41 AM 6005 EventLog The Event log service was started.
Tue Aug 10 2021 11:41 AM 6008 EventLog The previous system shutdown at 11:36:39 AM on 8/10/2021 was unexpected.
CLEAN Mon Aug 09 2021 12:51 PM 6005 EventLog The Event log service was started.
Mon Aug 09 2021 12:46 PM 6006 EventLog The Event log service was stopped.
CLEAN Tue Jul 20 2021 2:49 AM 6005 EventLog The Event log service was started.
Tue Jul 20 2021 2:45 AM 6006 EventLog The Event log service was stopped.
CLEAN Mon Jul 19 2021 11:09 PM 6005 EventLog The Event log service was started.
Mon Jul 19 2021 11:03 PM 6006 EventLog The Event log service was stopped.
CLEAN Thu Jul 15 2021 2:50 AM 6005 EventLog The Event log service was started.
Thu Jul 15 2021 2:42 AM 6006 EventLog The Event log service was stopped.
>> Please check the below drivers as they may be outdated. Consider upgrading them
>> While we provide the Storage drivers, these are just the default and basic version. We strongly recommend to go with the hardware vendor drivers instead.
==============================================================================================================================
STORAGE/NETWORK/AUDIO/VIDEO DRIVERS (currently using hardware resources)
==============================================================================================================================
Type Name Manufacturer Version Date Size Description
-----------------------------------------------------------------------------------------------------------------------------
STORAGE (SCSI) HPCISSS3.SYS PMC-Sierra, Inc. 63.10.0.64 03/02/2015 171 KB (174,776 bytes) Smart HBA
H241 Controller
STORAGE (SCSI) HPCISSS3.SYS PMC-Sierra, Inc. 63.10.0.64 03/02/2015 171 KB (174,776 bytes) Smart Array
P440ar Controller
STORAGE (IDE) STORAHCI.SYS Microsoft Corporation 6.3.9600.16384 08/22/2013 105 KB (107,872 bytes) Standard SATA
4
AHCI Controller
NETWORK B57ND60A.SYS Broadcom Corporation 17.0.0.3 05/06/2015 456 KB (467,224 bytes) HP Ethernet
1Gb 4-port 331i Adapter
VIDEO MXG2HDO64.SYS Matrox Graphics Inc. 9.15.1.102 08/07/2013 620 KB (634,632 bytes) Matrox G200eh
(HP) WDDM 1.2
Memory Utilization:
Utilization: 48%
Memory: 16128 MB
===========================================================================================================================================
========
PERFORMANCE INFORMATION
===========================================================================================================================================
========
MEMORY object (Win32_PerfFormattedData_PerfOS_Memory)
-------------------------------------------------------------------------------------------------------------------------------------------
--------
Total Physical Memory* 15.75 GB (16,127.61 MB) (16911028224 bytes)
Memory\Available Bytes 8.35 GB (8,551.19 MB) (8966569984 bytes)
Memory\Pool Non Paged Bytes 1.43 GB (1,465.56 MB) (1536753664 bytes)
Memory\Pool Paged Bytes 1.01 GB (1,032.29 MB) (1082429440 bytes)
Memory\Pool Paged Resident Bytes 0.93 GB (954.92 MB) (1001304064 bytes)
Memory\System Cache Resident Bytes 1.28 GB (1,307.15 MB) (1370648576 bytes)
Memory\Commit Limit 18.12 GB (18,559.61 MB) (19461165056 bytes)
Memory\Committed Bytes 7.36 GB (7,535.75 MB) (7901806592 bytes)
Memory\% Committed Bytes In Use 40 %
Memory\Free System Page Table Entries 16,452,290
CPU Usage
>> CPU usage from WMI and SVCHost on an average appears a bit high
Please check if there's a more recent version of the below drivers which are very old and consider upgrading them. Most of the time, outdated third party drivers can cause
serious issues in the OS performance too.
===========================================================================================================================================
========
3RD-PARTY RUNNING DRIVERS
===========================================================================================================================================
6
========
File Name Manufacturer Version Date Size
-------------------------------------------------------------------------------------------------------------------------------------------
--------
Cbcomms.sys Carbon Black, Inc. 6.2.1.15466 02/07/2019 364 KB (372,936 bytes)
Cbfsprocess2017.sys Callback Technologies, Inc. 2017.0.3.0 01/21/2019 56 KB (57,448 bytes)
E1r64x64.sys Intel Corporation 12.14.7.0 11/30/2017 520 KB (531,968 bytes)
Etdsmbus.sys ELAN Microelectronic Corp. 15.6.4.1 08/29/2016 32 KB (32,344 bytes)
Fsrs64.sys F-Secure Corporation 1.0.133.63 05/24/2021 164 KB (167,752 bytes)
Hmpalert.sys SurfRight B.V. 3.8.1.496 12/14/2020 681 KB (697,712 bytes)
Hpcisss3.sys Hewlett-Packard Company 62.6.0.64 07/15/2014 172 KB (176,528 bytes)
Hppsa.sys Hewlett-Packard Corporation 6.0.0.64 06/13/2014 33 KB (33,680 bytes)
Hpqilo3chif.sys Hewlett-Packard Company 3.10.0.0 11/23/2013 43 KB (43,920 bytes)
Hpqilo3core.sys Hewlett-Packard Company 3.9.0.0 05/22/2013 46 KB (47,384 bytes)
Hpqilo3whea.sys Hewlett-Packard Company 3.0.0.0 02/12/2010 18 KB (18,472 bytes)
Hpsa3.sys Hewlett-Packard Company 62.0.0.64 06/13/2014 168 KB (171,920 bytes)
Iqvw64e.sys Intel Corporation 1.3.0.6 06/23/2014 33 KB (33,640 bytes)
Mxg2hdo64.sys Matrox Graphics Inc. 9.15.1.102 08/07/2013 620 KB (634,632 bytes)
Pdfsd.sys Veritas Technologies LLC 42.3.2348.0 04/23/2021 139 KB (142,616 bytes)
Savonaccess.sys Sophos Limited 3.26.0.0 06/10/2020 211 KB (216,280 bytes)
Sntp.sys Sophos Limited 1.10.1245.0 11/25/2020 232 KB (237,520 bytes)
Sophosed.sys Sophos Limited 2.2.6.731 12/16/2020 1,219 KB (1,247,832 bytes)
Swi_callout.sys Sophos Limited 3.6.0.1 04/14/2019 47 KB (47,760 bytes)
Virtfile.sys Veritas Technologies LLC 2.8.2612.0 12/22/2020 374 KB (383,304 bytes)
Vstor2-x64.sys VMware, Inc. 6.1.2.439 09/07/2020 51 KB (52,576 bytes)
===================
1. Consider implementing the above-mentioned preventive action plans to prevent any performance issues
2. Create an NMI Registry Key to trigger an NMI signal from HP ILO Console to Trigger a memory dump – Steps mentioned below
3. Configure the server for a complete memory dump – Attached the steps - Restart server in order for changes to take effect.
4. Start Remote PerfMon log from a remote server and leave it running, and wait for the issue to occur – Steps mentioned below
5. Once the issue occurs, before the reboot, wait for about 5-10 minutes and stop the remote perfmon from a remote server and obtain the PerfMon logs – Make sure
to perform this step before the reboot.
6. Send an NMI signal from ILO console and crash the server remotely to obtain the Memory.dmp file –
7. The server can then be rebooted
8. Upload the data to workspace for analysis.
7
File Transfer - Case 2108100040001486
Instructions
HKLM\System\CurrentControlSet\Control\CrashControl
Value Name: NMICrashDump
Value Type: REG_DWORD
Value Data: 1
Steps to configure a binary circular remote perfmon log using logman method:
>> To setup the Remote Perfmon please do the following from an Administrative command prompt of a Remote Server:
Note: Please ensure to perform these steps from a remote server located on the same network.
* Replace DOMAIN\USERNAME with your domain name and the creds that has admin privileges on the problem machine.
* Replace SERVERNAME with the netbios name of the problematic machine.
* you will be asked to enter the password as soon as you enter the command. The window will not show any alphabets or special characters while you will enter the password
but keep on typing the password and hit enter once done.
Short Interval
=====================================================================================
Logman create counter remote_perfmon-Short -u DOMAIN\USERNAME * -f bincirc -v mmddhhmm -max 1024 -c
"\\SERVERNAME\LogicalDisk(*)\*" "\\SERVERNAME\Memory\*" "\\SERVERNAME\Network Interface(*)\*" "\\SERVERNAME\Paging
File(*)\*" "\\SERVERNAME\PhysicalDisk(*)\*" "\\SERVERNAME\Process(*)\*" "\\SERVERNAME\Redirector\*" "\\SERVERNAME\Server\*" "\\SERVERNAME\System\*" "\\SERVERNA
ME\Terminal
Services\*" "\\SERVERNAME\Processor(*)\*" "\\SERVERNAME\Cache\*" "\\SERVERNAME\NTDS(*)\*" "\\SERVERNAME\Database(lsass)\*" "\\SERVERNAME\Netlogon(*)\*" "\\SE
RVERNAME\Processor Information(*)\*" "\\SERVERNAME\Server Work Queues(*)\*" -si 00:00:01
=====================================================================================
8
Long Interval
=====================================================================================
Logman create counter remote_perfmon-Long -u DOMAIN\USERNAME * -f bincirc -v mmddhhmm -max 1024 -c
"\\SERVERNAME\LogicalDisk(*)\*" "\\SERVERNAME\Memory\*" "\\SERVERNAME\Network Interface(*)\*" "\\SERVERNAME\Paging
File(*)\*" "\\SERVERNAME\PhysicalDisk(*)\*" "\\SERVERNAME\Process(*)\*" "\\SERVERNAME\Redirector\*" "\\SERVERNAME\Server\*" "\\SERVERNAME\System\*" "\\SERVERNA
ME\Terminal
Services\*" "\\SERVERNAME\Processor(*)\*" "\\SERVERNAME\Cache\*" "\\SERVERNAME\NTDS(*)\*" "\\SERVERNAME\Database(lsass)\*" "\\SERVERNAME\Netlogon(*)\*" "\\SE
RVERNAME\Processor Information(*)\*" "\\SERVERNAME\Server Work Queues(*)\*" -si 00:01:00
=====================================================================================
You can then start and stop the log with the following commands
Logman.exe start remote_perfmon-Long
>> Here, we are going to capture a binary circular performance monitor log that will be constantly running and has a maximum size of 1024 MB (1 GB). The reason we are
going to use a binary circular log is that we can record the data continuously to the same log file, overwriting previous record with new data once the log file reaches its
maximum size.
>> Once we hit the hang issue next time, confirm that the issue is happening at that moment (wait for about 5-10 minutes), and then stop the logs from the remote server
using the commands below:
Logman.exe stop remote_perfmon-Short
Notes:
9
The Perfmon log will be located here: C:\PERFLOGS (on the remote server)
If you reboot the server while the log is being captured, you will need to start the logs again as they will not automatically restart on boot.
----------------------------------------------
Nagabhushan M V
Performance Support Engineer
Windows Reliability | Enterprise Platforms
Office: +1 855-425-1603 Ext 82335
Working Hours: 5:30 AM – 2:30 PM IST (Sunday – Thursday)
Email: nmv@microsoftsupport.com
Hi Nagabhushan,
Please see my comments inline. MS Diagnostic logs have been uploaded to portal.
Thank You.
Regards,
10
Ashutosh
Hello Ashutosh, Hope you have been well. My name is Nagabhushan MV. I am the Support Engineer taking on the Windows Performance Team who will be working with you further on this Service Request, referencing the SR number
2108100040001486 ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello Ashutosh,
My name is Nagabhushan MV. I am the Support Engineer taking on the Windows Performance Team who will be working with you further on this Service Request, referencing
the SR number 2108100040001486.
Having reviewed the case notes and recent emails, I did understand that one of the Server MUMSQLP01110 became unresponsive on 10/8/2021 at 10:00 am resulting in the
black screen and hard rebooting the server has recovered from the issue.
For a quick start before setting up a meeting to discuss the issue, could you please take a few minutes in providing the following additional information to the best of your
knowledge which will help us better understand and isolate the issue. We shall schedule a remote session to work further accordingly on the issue once we receive the
following details:
11
4. What is the primary role of the server? File Server for End Users
5. What is the Antivirus software installed on the server? Sophos Endpoint Agent
6. How many Servers/ Machines are affected with the issue ? Only One
7. Were there any software or hardware changes made to the system prior to the problem surfacing? No
8. What is the Current Status of the machine? After hard reboot host is up & running since then
9. Does the reboot temporarily fix the issue? yes
10. When it hangs/ shows black screen, what happens if you login via Console..? Black Screen
11. How frequent is the issue? Intermittent
12. Does RDP work? No
13. Is the server pingable at the time of black screen/ Hung issue? Yes
14. Does Ctrl+Alt+Delete work at console screen? NO
15. Are you able to launch Task Manager from the console? NO
16. Does the mouse & keyboard respond? Yes
17. Are you able to connect to file shares (if available) or default shares on the machine at the time of issue? No
18. Does Task Manager show a particular process taking up CPU/ Memory?
19. What is the business impact of this issue? Around 400 Business users uable to access shared folder
20. What is your working hours and are you comfortable with my work hours being 5:30 AM – 2:30 PM IST Sunday – Thursday ? Mon - Fri (9:30 AM- 5:30 PM IST)
Additionally, please also obtain MS Diagnostics results from the affected server to have an overview of the Server environment and driver related details.
MS Diagnostic Logs:
If the device experiencing the issue is Windows 7 or newer, or Windows Server 2008 R2 or newer:
-------------------------------
12
Nagabhushan M V
Performance Support Engineer
Windows Reliability | Enterprise Platforms
Office: +1 855-425-1603 Ext 82335
Working Hours: 5:30 AM – 2:30 PM IST (Monday – Friday)
Email: nmv@microsoftsupport.com
Issue:
Internal Support Team at the Customer end were unable to RDP, and End users were unable to access file shares on a one physical Windows
Server 2012 R2 Server, within the same network, few days ago.
Scope:
We will try to provide possible causes to this issue, as the incident has occurred in the past it is difficult to find the root cause of the issue if we
don’t have relevant data.
The only data that we may have is the event viewer logs and MSDT report which at times are inconclusive.
13
Summary:
2. End Users were unable to access the file shares on the same Server
3. End Users, Support Team and the Server are part of the same Network
4. The issue occurred on 10/8/2021 at 10:00 am
5. When checked, the windows Server was frozen, and screen was black
6. Customer had to hard reboot the Server at 11:30 am on the same day and after that the issue got resolved.
7.The Customer was able to ping the Server at the time of issue
8. It is not clear whether the ports were listening at the time of issue occurrence because the Server could not be accessed
9. Informed the Customer that the issue will be taken care by Performance Team
Next Actions:
The case will be moved to the Performance Team since issues where the system is frozen are taken care of by the Performance Team
Hello Ashutosh,
Hello Kaleem,
Thank You.
Regards,
Ashutosh
"Kaleem Mansoor Shah" ---11-08-2021 01:53:22 PM---Hello Ashutosh, Since , I am on a different call right now and I won’t be available at 02:00 PM,
Hello Ashutosh, Since , I am on a different call right now and I won’t be available at 02:00 PM, Please let me know if we can reschedule the call to 02:30 PM, Thanks and Regards, ZjQcmQRYFpfptBannerStart
16
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello Ashutosh,
Since , I am on a different call right now and I won’t be available at 02:00 PM,
Hello Ashutosh,
Kindly find the meeting invite for 2:00 PM IST today below,
________________________________________________________________________________
17
Microsoft Teams meeting
Join on your computer or mobile app
Click here to join the meeting
Or call in (audio only)
+1 509-495-1059,,960305351# United States, Spokane
Phone Conference ID: 960 305 351#
Find a local number| Reset PIN
Learn More| Meeting options
Hello Kaleem,
Regards,
Ashutosh
"Kaleem Mansoor Shah" ---11-08-2021 08:41:25 AM---Hello Ashutosh, Thank you for contacting Microsoft.
Hello Ashutosh, Thank you for contacting Microsoft. My name is Kaleem and, I am the Windows Networking support professional aligned to work with you on this SR # 2108100040001486 This email is to notify you that I have taken ownership of this
Hello Ashutosh,
This email is to notify you that I have taken ownership of this case. You can always reach out to me using the contact information provided in my signature.
I request you to help me understand the answers to the queries I have mentioned in the case summary below:
Queries:
1. Can you please describe the problem?>> able to ping machine but Unable to access hosted application,
Shared folder & RDP session
2. When did the issue occur, and do you see any error message? >> RDP session window stuck at
"Configuring Remote Session" , Shared folder access Error: -Windows Cannot access \\MUMSQLP01110
3. How many machines are affected?>> Only One
19
4. Are the machines Physical or Virtual?>>Physical
5. Were there any changes made to the environment prior to this issue?>> No changes
6. What is the frequency of the issue? Is it intermittent or persistent?>> Intermittent
7. Are the source and destination machines on the same network or different networks? >>Source &
destination machines connected on same network
8. How are you trying to access the shares? Run or CMD or Windows Explorer? >>Run & Network mapped
drive
9. Are the source and the destination in the same domain? >> Same domain
10. Are the RDP and file share issue occurring on same machine or different machines?>> Same machine
11. What is the UNC path being used? Examples are: a & e
a. \\server\share
b. \\contoso\Memes
c. \\IPaddress\sharename
d. \\DomainName\sharename
e. \\ServerFQDN\sharename
Kindly suggest a time slot which would be convenient for you to work on this issue and I will share the meeting invite, accordingly.
My working hours are 7.30am - 4.30 pm IST.
20
[attachment "Complete Memory Dump.docx" deleted by Ashutosh N Kumar/India/IBM]
21