Wait Chain General 481 WAIT CHAIN GENERAL Wait Chain pattern is simply a sequence of causal relations between events: thread A is waiting for an event E to happen that threads B, C or
Trang 1Wait Chain (General) 481
WAIT CHAIN (GENERAL)
Wait Chain pattern is simply a sequence of causal relations between events:
thread A is waiting for an event E to happen that threads B, C or D are supposed
to signal at some time in the future but they are all waiting for an event F to happen
that a thread G is about to signal as soon as it finishes processing some critical task:
This subsumes various deadlock patterns too which are causal loops where a
thread A is waiting for an event AB that a thread B will signal as soon as the thread A
signals an event BA the thread B is waiting for:
Trang 2Thread B
Thread A
In this context “Event” means any type of synchronization object, critical section,
LPC/RPC reply or data arrival through some IPC channel and not only Win32 event
ob-ject or kernel _KEVENT
As the first example of Wait Chain pattern I show a process being terminated and
waiting for another thread to finish or in other words, considering thread termination as
an event itself, the main process thread is waiting for the second thread object to be
signaled The second thread tries to cancel previous I/O request directed to some
de-vice However that IRP is not cancellable and process hangs This can be depicted on the
following diagram:
Trang 3Wait Chain (General) 483
Thread B (Event A)
Event B Thread A
where Thread A is our main thread waiting for Event A which is the thread B itself
wait-ing for I/O cancellation (Event B) Their stack traces are:
THREAD 8a3178d0 Cid 04bc.01cc Teb: 7ffdf000 Win32Thread: bc1b6e70 WAIT:
(Unknown) KernelMode Non-Alertable
8af2c920 Thread
Not impersonating
DeviceMap e1032530
Owning Process 89ff8d88 Image: processA.exe
Wait Start TickCount 80444 Ticks: 873 (0:00:00:13.640)
Context Switch Count 122 LargeStack
UserTime 00:00:00.015
KernelTime 00:00:00.156
Win32 Start Address 0x010148a4
Start Address 0x77e617f8
Stack Init f3f29000 Current f3f28be8 Base f3f29000 Limit f3f25000 Call 0
Priority 15 BasePriority 13 PriorityDecrement 0
ChildEBP RetAddr
f3f28c00 80833465 nt!KiSwapContext+0x26
f3f28c2c 80829a62 nt!KiSwapThread+0x2e5
f3f28c74 8094c0ea nt!KeWaitForSingleObject+0x346 ; stack trace
with arguments shows the first parameter as 8af2c920
f3f28d0c 8094c63f nt!PspExitThread+0×1f0
f3f28d24 8094c839 nt!PspTerminateThreadByPointer+0×4b
f3f28d54 8088978c nt!NtTerminateProcess+0×125
f3f28d54 7c8285ec nt!KiFastCallEntry+0xfc
Trang 4THREAD 8af2c920 Cid 04bc.079c Teb: 7ffd7000 Win32Thread: 00000000 WAIT:
(Unknown) KernelMode Non-Alertable
Owning Process 89ff8d88 Image: processA.exe
Wait Start TickCount 81312 Ticks: 5 (0:00:00:00.078)
Context Switch Count 169 LargeStack
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0×77da3ea5
Start Address 0×77e617ec
Stack Init f3e09000 Current f3e08bac Base f3e09000 Limit f3e05000 Call 0
Priority 13 BasePriority 13 PriorityDecrement 0
f3e08d4c 7c8285ec nt!KiServiceExit+0×56
By inspecting IRP we can see a device it was directed to, see that it has the cancel
bit but doesn’t have a cancel routine:
0: kd> !irp 8ad26260 1
Irp is active with 5 stacks 4 is current (= 0x8ad2633c)
No Mdl: No System Buffer: Thread 8af2c920: Irp stack trace
Trang 5Wait Chain (General) 485
Trang 6MANUAL DUMP (PROCESS)
Now I discuss Manual Dump pattern as seen from process memory dumps It is
not possible to reliably identify manual dumps here because a debugger or another
process dumper might have been attached to a process noninvasively and not leaving
traces of intervention so we can only rely on the following information:
Comment field
Loading Dump File [C:\kktools\userdump8.1\x64\notepad.dmp]
User Mini Dump File with Full Memory: Only application data is available
Comment: 'Userdump generated complete user-mode minidump with Standalone
function on COMPUTER-NAME'
Absence of exceptions
Loading Dump File [C:\UserDumps\notepad.dmp]
User Mini Dump File with Full Memory: Only application data is available
Symbol search path is:
srv*c:\mss*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Vista Version 6000 MP (2 procs) Free x64
Product: WinNt, suite: SingleUserTS
Debug session time: Mon Dec 17 16:31:31.000 2007 (GMT+0)
System Uptime: 0 days 0:45:11.148
Process Uptime: 0 days 0:00:36.000
user32!ZwUserGetMessage+0xa:
00000000`76c8e6aa c3 ret
0:000> ~*kL
0 Id: 1b8.ed4 Suspend: 1 Teb: 000007ff`fffdc000 Unfrozen
Child-SP RetAddr Call Site
Trang 7Manual Dump (Process) 487
Wake debugger exception
Loading Dump File [C:\UserDumps\notepad2.dmp]
User Mini Dump File with Full Memory: Only application data is available
Symbol search path is:
srv*c:\mss*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Vista Version 6000 MP (2 procs) Free x64
Product: WinNt, suite: SingleUserTS
Debug session time: Mon Dec 17 16:35:37.000 2007 (GMT+0)
System Uptime: 0 days 0:49:13.806
Process Uptime: 0 days 0:02:54.000
This dump file has an exception of interest stored in it
The stored exception information can be accessed via ecxr
(314.1b4): Wake debugger - code 80000007 (first/second chance not
available)”
user32!ZwUserGetMessage+0xa:
00000000`76c8e6aa c3 ret
Break instruction exception
Loading Dump File [C:\UserDumps\notepad3.dmp]
User Mini Dump File with Full Memory: Only application data is available
Symbol search path is:
srv*c:\mss*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Vista Version 6000 MP (2 procs) Free x64
Product: WinNt, suite: SingleUserTS
Debug session time: Mon Dec 17 16:45:15.000 2007 (GMT+0)
System Uptime: 0 days 0:58:52.699
Process Uptime: 0 days 0:14:20.000
This dump file has an exception of interest stored in it
The stored exception information can be accessed via ecxr
ntdll!DbgBreakPoint:
00000000`76ecfdf0 cc int 3
0:001> ~*kL
0 Id: 1b8.ed4 Suspend: 1 Teb: 000007ff`fffdc000 Unfrozen
Child-SP RetAddr Call Site
Trang 8# 1 Id: 1b8.ec4 Suspend: 1 Teb: 000007ff`fffda000 Unfrozen
Child-SP RetAddr Call Site
00000000`030df798 00000000`76f633e8 ntdll!DbgBreakPoint
00000000`030df7a0 00000000`76d7cdcd ntdll!DbgUiRemoteBreakin+0×38
00000000`030df7d0 00000000`76ecc6e1 kernel32!BaseThreadInitThunk+0xd
00000000`030df800 00000000`00000000 ntdll!RtlUserThreadStart+0×1d
The latter might also be some assertion statement in the code leading to a
process crash like in the following instance of Dynamic Memory Corruption pattern
(heap corruption, page 257):
09aef0bc 77fb76aa ntdll!DbgBreakPoint
09aef0c4 77fa65c2 ntdll!RtlpBreakPointHeap+0×26
09aef2bc 77fb5367 ntdll!RtlAllocateHeapSlowly+0×212
09aef340 77fa64f6 ntdll!RtlDebugAllocateHeap+0xcb
09aef540 77fcc9e3 ntdll!RtlAllocateHeapSlowly+0×5a
09aef854 786f1ee4 rpcrt4!I_RpcGetBufferWithObject+0×6e
09aef860 786f1ea4 rpcrt4!I_RpcGetBuffer+0xb
09aef86c 78754762 rpcrt4!NdrGetBuffer+0×2b
09aefab8 796d78b5 rpcrt4!NdrClientCall2+0×3f9
09aefac8 796d7821 advapi32!LsarOpenPolicy2+0×14
09aefb1c 796d8b04 advapi32!LsaOpenPolicy+0xaf
09aefb84 796d8aa9 advapi32!LookupAccountSidInternal+0×63
09aefbac 0aaf5d8b advapi32!LookupAccountSidW+0×1f
WARNING: Stack unwind information not available Following frames may be
Trang 9Manual Dump (Process) 489
Trang 10WAIT CHAIN (CRITICAL SECTIONS)
Here is another example of Wait Chain pattern (page 481) where objects
are critical sections
WinDbg can detect them if we use !analyze -v -hang command but it detects only
one and not necessarily the longest or widest chain in cases with multiple wait chains:
Looking at threads we see this chain and we also see that the final thread
is blocked waiting for a socket (shown in smaller font for visual clarity)
ChildEBP RetAddr Args to Child
0fe2a09c 7c942124 71933a09 00000b50 00000001 ntdll!KiFastSystemCallRet
0fe2a0a0 71933a09 00000b50 00000001 0fe2a0c8 ntdll!NtWaitForSingleObject+0xc
0fe2a0dc 7194576e 00000b50 00000234 00000000 mswsock!SockWaitForSingleObject+0x19d
0fe2a154 71a12679 00000234 0fe2a1b4 00000001 mswsock!WSPRecv+0x203
0fe2a190 62985408 00000234 0fe2a1b4 00000001 WS2_32!WSARecv+0x77
0fe2a1d0 6298326b 00000234 0274ebc6 00000810 component!wait+0x338
Trang 11Wait Chain (Critical Sections) 491
If we look at all held critical sections we would see another thread that blocked
more than 125 other threads:
Trang 120ff2ffec 00000000 77b9b4bc 060cf9a0 00000000 kernel32!BaseThreadStart+0×34
Searching for any thread waiting for critical section 051e4bd8 gives us:
8 Id: 8d8.924 Suspend: 1 Teb: 7ffd5000 Unfrozen
ChildEBP RetAddr Args to Child
Trang 13Alien Component 493 PART 4: CRASH DUMP ANALYSIS ANTIPATTERNS
ALIEN COMPONENT
In any domain of activity where patterns exist we can find anti-patterns too They
are bad solutions for recurrent problems in specific contexts One of them I would like
to introduce briefly is called Alien Component In essence, when every technique fails or
we run out of WinDbg commands we look at some innocent component we have never
seen before or don’t have symbols for: be it some driver or hook Of course, this
compo-nent cannot be the compocompo-nent developed by the company we are working for
Trang 14ZIPPOCRICY
Let’s define Zippocricy – the common sin in software support environments
worldwide: someone gets something from a customer in an archived form and without
checking the contents forwards it further to another person in support chain By the
time the evidence gets unzipped somewhere, checked and found corrupt or irrelevant
the customer suffers not hours but days
Happens not only with crash dumps but with any type of problem evidence
Trang 15Word of Mouth 495
WORD OF MOUTH
Many engineers say, “I didn’t know about this debugging command, let’s use it!”
after a training session or reading other people’s analysis of crash dumps A
year later we hear the same phrase from them about another debugging command In
the mean time they continue to use the same set of commands they know about until
they hear the old new one
This is a manifestation of Word of Mouth anti-pattern
General solution: Know your tools Study them proactively
Example solution: periodically read and re-read WinDbg help
Trang 16WRONG DUMP
A customer reports application.exe crashes and we ask for a dump file We get a
dump, open it and see that the dump is not from our application.exe We ask for print
spooler crash dump and we get mplayer.exe crash dump I originally thought about
call-ing it Wrong Dump pattern and place it into the patterns category but after writcall-ing
about Zippocricy (page 494) I clearly see it as an anti-pattern It is not a rocket science to
check a process name in a dump file before sending it for analysis:
Load the user process dump in WinDbg
Type command symfix; reload; !analyze -v and wait
until WinDbg is not busy analyzing Find PROCESS_NAME: in the output We get something like:
PROCESS_NAME: spoolsv.exe
We can also use dumpchk.exe from Debugging Tools for Windows
(http:/support.citrix.com/article/CTX108825)
Another example is when we ask for a complete memory dump but we get a
ker-nel dump or various mini-dumps Fortunately Citrix DumpCheck Explorer extension
can warn users before they submit a dump file
Trang 17Fooled by Description 497
FOOLED BY DESCRIPTION
From my observation an engineer with software development background opens
a crash dump after glancing at a problem description provided by a customer or
even without reading it first Only if the problem is not immediately obvious from the
memory dump file the engineer will read the problem description thoroughly On the
contrary, an engineer with technical support or system administration background will
thoroughly read the problem description first In the latter case the description might
influence the direction of analysis
Here is an example The description says: slow application start and we have a
memory dump from a process An engineer with technical support background will most
likely look for hang patterns inside the dump An engineer with experience writing
unmanaged applications in C and C++ will open the memory dump and check an
excep-tion stored in it and if it is a breakpoint the suspicion might arise that the memory dump
was taken manually because of the hanging process Based on the analysis the engineer
might even correct the problem description or add questions that clarify the discrepancy
between what is seen in the dump and what users perceive
Trang 18NEED THE CRASH DUMP
This is might be the first thought when an engineer gets a stack trace fragment
without symbolic information It is usually based on the following presupposition:
We need an actual dump file to suggest further troubleshooting steps
This is not actually true unless it is the first time we have the problem and a get
stack trace for it Consider the following fragment from one bugcheck kernel dump
when no symbols were applied because the customer didn’t have them:
b90529f8 8085eced nt!KeBugCheckEx+0x1b
b9052a70 8088c798 nt!MmAccessFault+0xb25
b9052a70 bfabd940 nt!_KiTrap0E+0xdc
WARNING: Stack unwind information not available Following frames may be
wrong.
b9052b14 bfabe452 MyDriver+0x27940
We can convert module+offset information into module!function+offset2 using
MAP files or using DIA SDK (Debug Interface Access SDK) to query PDB files if we know
module timestamp This might be seen as a tedious exercise but we don’t need to do it
if we keep raw stack trace signatures in some database when doing crash dump analysis
If we use our own symbol servers we might want to remove references to them and
reload symbols Then redo previous stack trace commands
In this case similar previous bugcheck crash dumps were analyzed months ago
and engineers saved stacks trace prior to applying symbols This helped to point to the
solution without requesting the crash dump corresponding to that stack trace
Trang 19Be Language 499
BE LANGUAGE
This is about excessive use of “is” and was inspired by Alfred Korzybski notion of
how “is” affects our understanding of the world In the context of technical support the
use of certain verbs sometimes leads to wrong troubleshooting and debugging paths
For example, the following phrase:
It is our pool tag It is effected by driver A, driver B and driver C
Surely driver A, driver B and driver C were not developed by the same company
that introduced the problem pool tag (smells Alien Component here, page 493) Unless
supported by solid evidence the better phrase shall be:
It is our pool tag It might have been effected by driver A, driver B or driver C
I’m not advocating to completely eradicate “be” verbs but to be conscious in
their use