Monday, August 18, 2014

PDC_WATCHDOG_TIMEOUT (14f) debugging

I received a PDC_WATCHDOG_TIMEOUT (14f) crash dump, although I seemed to have misplaced the source. What a shame! Anyway, this bug check is pretty mysterious. There's very little to no documentation on it, and not too many have had it occur on their systems to show up in any sort of web search. With that said, let's do our best to get some info on it!

 PDC_WATCHDOG_TIMEOUT (14f)  
 A system component failed to respond within the allocated time period,  
 preventing the system from exiting connected standby.  
 Arguments:  
 Arg1: 0000000000000002, Client ID of the hung component.  
 Arg2: 0000000000000002, A resiliency client failed to respond.  
 Arg3: fffff801867d3578, Pointer to the resiliency client (pdc!_PDC_RESILIENCY_CLIENT).  
 Arg4: ffffd001b7f61b30, Pointer to a pdc!PDC_14F_TRIAGE structure.  

We can see right away that the cause of the bug check itself is:
A system component failed to respond within the allocated time period, preventing the system from exiting connected standby.
With that said, now is a good time to discuss connected standby. Connected standby is a low-power state (implemented in Windows 8, also in 8.1) that features extremely low power consumption while maintaining a constant internet connection. Here is how to trigger connected standby:

  • Press the system power button.
  • Close the lid or tablet cover, or close the tablet into an attached dock.
  • Select Sleep from the Power button on the Settings charm

The best comparison is a smartphone's power button. When you press the power button on one of today's smartphones, it will transition to a similar state as opposed to entirely shutting down. This way, when you press the power button again, it will start right back up from where you previously left off.

Now that we know how to trigger the connected standby, how does it wake up and transition to its active state again?

  • Press the system power button.
  • Open the lid on a clamshell form-factor system.
  • Open the tablet if it is connected to a portable dock with a keyboard (similar to a lid in a clamshell system).
  • Generate input on an integrated or attached keyboard, mouse, or touchpad.
  • Press the Windows button that is integrated into the system display.

Other than user-related actions, system components, programs, etc, can wake the system from connected standby as well. For example, if the user has an incoming Skype call, the system will immediately awake and create a 25 second time frame to answer the call. If it is not answered, the call is canceled and the system will go back into connected standby.

System components or devices can also wake the core silicon or SoC from connected standby, even though those events may not turn on the display. Nearly all devices connected to a connected standby system are expected to be capable of waking the SoC from its deepest idle power state.

--------------------

Now that we understand connected standby, we now understand that for some reason a specific system component failed to respond during the set time period, therefore the system remained in connected standby when it should have woken up.

First of all, what kind of device is this given the fact that we likely wouldn't see connected standby on a desktop (or maybe even a laptop)?

 0: kd> !sysinfo machineid  
 Machine ID Information [From Smbios 2.8, DMIVersion 39, Size=1106]  
 BiosMajorRelease = 3  
 BiosMinorRelease = 7  
 FirmwareMajorRelease = 32  
 FirmwareMinorRelease = 0  
 BiosVendor = American Megatrends Inc.  
 BiosVersion = 3.07.0150  
 BiosReleaseDate = 05/15/2014  
 SystemManufacturer = Microsoft Corporation  
 SystemProductName = Surface Pro 3  
 SystemFamily = Surface  
 SystemVersion = 1  
 SystemSKU = Surface_Pro_3  
 BaseBoardManufacturer = Microsoft Corporation  
 BaseBoardProduct = Surface Pro 3  
 BaseBoardVersion = 1  

Ah, it's a Surface tablet! It all makes sense now.

As this was a minidump, our call stack was extremely uninformative:

 0: kd> k  
 Child-SP     RetAddr      Call Site  
 ffffd001`b7f61af8 fffff801`867dcd72 nt!KeBugCheckEx  
 ffffd001`b7f61b00 fffff803`712daadb pdc!PdcpResiliencyWatchdog+0xa6  
 ffffd001`b7f61b50 fffff803`71356794 nt!ExpWorkerThread+0x293  
 ffffd001`b7f61c00 fffff803`713e15c6 nt!PspSystemThreadStartup+0x58  
 ffffd001`b7f61c60 00000000`00000000 nt!KiStartSystemThread+0x16  

We can see we're starting a thread which turns out to be a worker thread, and then we call into pdc!PdcpResiliencyWatchdog+0xa6. This implies we failed to complete the resiliency phase in the allotted time period (however long). Usually when you see resiliency phase issues on a device regarding anything in terms of waking from an inactive state (sleep, hibernate, etc), the first thing to look at is network. For example, the D0 IRP for the required network device may not have completed in time due to a 3rd party conflict, etc.

We can further confirm we're likely dealing with a network issue by taking a look at our bucket_id:

 FAILURE_BUCKET_ID: 0x14F_WCM_pdc!PdcpResiliencyWatchdog  

WCM is the Windows Connection Manager, which enables the creation and configuration of connection manager software.

As this is a Surface Tablet, one can imagine it's likely using WiFi. If a Wi-Fi connection is available, the system will wait for the Wi-Fi device only, regardless of whether a mobile broadband (MBB) connection is available. With this said, I took a look at what loaded modules we had to see if any antivirus was installed, firewall, etc. I was essentially looking for anything that could have accidentally interfered with the network upon wake.

Here's what I found:

 0: kd> lmvm MBAMSwissArmy  
 start       end         module name  
 fffff801`8851d000 fffff801`8853e000  MBAMSwissArmy  (deferred)         
   Image path: \??\C:\windows\system32\drivers\MBAMSwissArmy.sys  
   Image name: MBAMSwissArmy.sys  
   Timestamp:    Thu Mar 20 18:12:35 2014  

Malwarebytes Anti-malware driver, listed and loaded.

I asked the user to uninstall Malwarebytes for temporary troubleshooting purposes, and the crashes no longer occurred. I hope the user also contacted Malwarebytes support to work out any possible issues that need to be patched.

I hope to see more of these bug checks in the future, and hopefully with a kernel-dump next time as well so I can go in-depth!

Thanks for reading!

Wednesday, August 13, 2014

Double Fault

I was recently sent a pretty neat kernel-dump by my good friend Jared. I've always wanted to go into double faults, so let's get started! Thanks, Jared : )

 UNEXPECTED_KERNEL_MODE_TRAP (7f)  
 This means a trap occurred in kernel mode, and it's a trap of a kind  
 that the kernel isn't allowed to have/catch (bound trap) or that  
 is always instant death (double fault).  
 Arguments:  
 Arg1: 0000000000000008, EXCEPTION_DOUBLE_FAULT  
 Arg2: 0000000080050033  
 Arg3: 00000000000406f8  
 Arg4: fffff800032aa875  

In our case, the 1st argument was 8, therefore this indicates a double fault occurred. So, what is a double fault, and when/why does one occur?

Double faults occur when an exception cannot be handled by the handler, or when an exception occurs when the CPU is already trying to call an exception handler for a previously thrown exception. In most cases, two exceptions that were thrown at the exact same time are handled separately, however in some cases, you may have a situation occur in which a pagefault occurs, but the exception handler is located in a not-present page, two page faults would occur and neither of them can be handled. This is known as a double fault! Also, double faults can occur (like in this scenario) when the processor cannot properly service an interrupt that is pending.

 4: kd> k  
 Child-SP     RetAddr      Call Site  
 fffff880`009b9de8 fffff800`0328b169 nt!KeBugCheckEx  
 fffff880`009b9df0 fffff800`03289632 nt!KiBugCheckDispatch+0x69  
 fffff880`009b9f30 fffff800`032aa875 nt!KiDoubleFaultAbort+0xb2  <- Uh oh, double fault!
 fffff880`03dccfd0 fffff800`032909ba nt!KiIpiSendRequest+0x305  <- As it is a multiprocessor job, processor #4 sent an inter-processor interrupt to interrupt another processor saying "Hey, we need to flush the TLB."
 fffff880`03dcd090 fffff800`032ec198 nt!KeFlushMultipleRangeTb+0x22a  <- Flushing translation lookaside buffer, this is a multiprocessor job.
 fffff880`03dcd160 fffff800`033935ea nt! ?? ::FNODOBFM::`string'+0x204ce  
 fffff880`03dcd350 fffff800`03394be7 nt!MiEmptyWorkingSet+0x24a  <- Removing as many pages as possible from the working set.
 fffff880`03dcd400 fffff800`0372f371 nt!MiTrimAllSystemPagableMemory+0x218  <- Unmapping all pageable system memory.
 fffff880`03dcd460 fffff800`0372f4cf nt!MmVerifierTrimMemory+0xf1  
 fffff880`03dcd490 fffff800`0372fc24 nt!ViKeRaiseIrqlSanityChecks+0xcf  <- As verifier is enabled, it's doing a sanity check. A sanity check is essentially verifier saying "Okay, what IRQL are we on and are we supposed to be here?"
 fffff880`03dcd4d0 fffff880`018443f5 nt!VerifierKeAcquireSpinLockRaiseToDpc+0x54  <- IRST resetting IRQL to DISPATCH (2) and then acquiring a lock.
 fffff880`03dcd530 fffff880`018222a2 iaStor+0x253f5  <- Intel Rapid Storage Technology
 fffff880`03dcd560 fffff880`01871489 iaStor+0x32a2  <- Intel Rapid Storage Technology

 4: kd> ub nt!KiIpiSendRequest+0x305  
 nt!KiIpiSendRequest+0x2eb:  
 fffff800`032aa85b 5e       pop   rsi  
 fffff800`032aa85c 5d       pop   rbp  
 fffff800`032aa85d c3       ret  
 fffff800`032aa85e 8bc6      mov   eax,esi  
 fffff800`032aa860 e9e2feffff   jmp   nt!KiIpiSendRequest+0x1d7 (fffff800`032aa747)  
 fffff800`032aa865 0fb70db4892100 movzx  ecx,word ptr [nt!KeActiveProcessors (fffff800`034c3220)]  
 fffff800`032aa86c 0fb705af892100 movzx  eax,word ptr [nt!KeActiveProcessors+0x2 (fffff800`034c3222)]  
 fffff800`032aa873 8bfa      mov   edi,edx  

By unassmembling nt!KiIpiSendRequest+0x305 backwards, it looks like there's a check for active processors, and then the attempt to send the IPI.

 4: kd> !ipi  
 IPI State for Processor 0  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 1  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 2  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 3  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 4  
   TargetCount     0 PacketBarrier    0 IpiFrozen   0 [Running]  
 IPI State for Processor 5  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 6  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  
 IPI State for Processor 7  
   TargetCount     0 PacketBarrier    0 IpiFrozen   2 [Frozen]  

By running !ipi we can check the inter-processor interrupt state for every processor on the box. We can see here that every single processor (except #4) is in a frozen state (idle), therefore obviously our IPI is never going to be serviced, will remain pending, and we're going to double fault.

 4: kd> lmvm iaStor  
 start       end         module name  
 fffff880`0181f000 fffff880`01bc3000  iaStor   (no symbols)        
   Loaded symbol image file: iaStor.sys  
   Image path: \SystemRoot\system32\DRIVERS\iaStor.sys  
   Image name: iaStor.sys  
   Timestamp:    Wed Feb 01 19:15:24 2012  

The IRST driver is dated from early 2012, which is likely the problem since it is a notoriously problematic driver, and it gets worse as it gets older. The newer update would likely solve it, but honestly, I always usually recommend a user safely removes and replaces this driver with the standard MSFT driver if they aren't running a RAID setup. Kaspersky was also present on this system, and antivirus suites don't tend to play nice with this software either.

This post also shows how helpful Driver Verifier is, and how without it in this specific scenario, we likely would have had no idea what was causing this, and may interpret it as a hardware problem.

Thanks for reading!

Saturday, August 9, 2014

MEMORY_CORRUPTION_STRIDE

You know when you have something you really want to write a blog post about, but you don't have a crash dump for it? Debugger problems. Fortunately enough for me, I searched Google for a live crash dump link and found one. Happy days! Thanks to this person from four or so years ago for their crash dump :~)

Let's take a look at our basic bug check information in this case. This time, let's use code boxes. I never use code boxes on my blog, but now it's time!

 SYSTEM_SERVICE_EXCEPTION (3b)  
 An exception happened while executing a system service routine.  
 Arguments:  
 Arg1: 00000000c0000005, Exception code that caused the bugcheck  
 Arg2: fffff80002cc272d, Address of the instruction which caused the bugcheck  
 Arg3: fffff8800a555070, Address of the context record for the exception that caused the bugcheck  
 Arg4: 0000000000000000, zero.  

As with most 0x3B's, our exception was specifically an access violation.

 2: kd> ln fffff80002cc272d  
 (fffff800`02cc2590)  nt!KiDpcInterrupt+0x19d  | (fffff800`02cc2780)  nt!KiDpcInterruptBypass  

The violation in this case specifically occurred in nt!KiDpcInterrupt+0x19d.

 2: kd> .cxr 0xfffff8800a555070;r  
 rax=0000000000000001 rbx=fffffa8006b24b60 rcx=0000000000000000  
 rdx=000001af00000000 rsi=0000000000000000 rdi=0000000000000003  
 rip=fffff80002cc272d rsp=fffff8800a555a50 rbp=fffff8800a555ad0  
  r8=0000000000000000 r9=0000000000000001 r10=0000000000000000  
 r11=0000000000000064 r12=0000000000000000 r13=0000000000000000  
 r14=0000000000000064 r15=000007ff00042020  
 iopl=0     nv up di pl zr na po nc  
 cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b       efl=00010046  
 nt!KiDpcInterrupt+0x19d:  
 fffff800`02cc272d 0fae1f     stmxcsr dword ptr [rdi] ds:002b:00000000`00000003=????????  

On the instruction we faulted on, we failed storing the contents of the MXCSR register within rdi (0000000000000003). We can certainly imagine 00000000`00000003 is completely invalid, so therein lies our problem.

So, why are we hitting a pagefault within a DPC interrupt? Good question! Let's run the following:

 !chkimg -lo 50 -db -v !nt  

!chkimg compares an image with its original copy. More specifically, it does this by comparing the image of an executable file in memory to the copy of the file that resides on a symbol store.

The -lo 50 parameter limits the number of output lines to 50. Not too much and not too little.

The -db parameter displays mismatched areas in a format that is similar to the db debugger command. Therefore, each display line shows the address of the first byte in the line, followed by up to 16 hexadecimal byte values. The byte values are immediately followed by the corresponding ASCII values. All nonprintable characters, such as carriage returns and line feeds, are displayed as periods (.). The mismatched bytes are marked by an asterisk (*).

The -v parameter displays extra verbose information.

!nt is the module, which is of course the kernel.

 2: kd> !chkimg -lo 50 -db -v !nt  
 Searching for module with expression: !nt  
 Will apply relocation fixups to file used for comparison  
 Will ignore NOP/LOCK errors  
 Will ignore patched instructions  
 Image specific ignores will be applied  
 Comparison image path: c:\localsymbols\ntkrnlmp.exe\4A5BC6005dd000\ntkrnlmp.exe  
 No range specified  

Above we can see that as I noted above, it's comparing the kernel image from the crash dump to the latest symbol stored on my local symbol cache. If it wasn't available locally, it'd grab it from the symbol server.

 Scanning section:  .text  
 Size: 1685025  
 Range to scan: fffff80002c06000-fffff80002da1621  
 Total bytes compared: 1685025(100%)  
 Number of errors: 40  

So we have 40 errors specifically in the .text section of the kernel that was scanned.

 fffff80002cc2680 19 b9 01 00 00 00 44 *44 22 c1 fb e8 80 17 f9 *48 ......DD"......H  
 fffff80002cc2690 fa b9 00 00 00 00 44 *45 22 c1 65 48 8b 0c 25 *34 ......DE".eH..%4  
 fffff80002cc26a0 01 00 00 f7 01 00 00 *25 40 74 25 f6 41 02 02 *85 .......%@t%.A...  
 fffff80002cc26b0 0e e8 8a 68 05 00 65 *8b 8b 0c 25 88 01 00 00 *48 ...h..e...%....H  
 ...  
 fffff80002cc2700 8b 55 d8 4c 8b 4d d0 *ba 8b 45 c8 48 8b 55 c0 *00 .U.L.M...E.H.U..  
 fffff80002cc2710 8b 4d b8 48 8b 45 b0 *07 8b e5 48 8b ad d8 00 *89 .M.H.E....H.....  
 fffff80002cc2720 00 48 81 c4 e8 00 00 *ff 0f 01 f8 48 cf 0f ae *1f .H.........H....  
 fffff80002cc2730 ac 0f 28 45 f0 0f 28 *4c 00 0f 28 55 10 0f 28 *40 ..(E..(L..(U..(@  
 ...  
 fffff80002cc2880 24 10 48 89 74 24 18 *38 89 64 24 20 48 8b f9 *00 $.H.t$.8.d$ H...  
 fffff80002cc2890 8b d1 49 8b f0 4c 8b *15 49 83 e9 11 48 83 ea *01 ..I..L..I...H...  
 fffff80002cc28a0 4c 8b da 48 8b ef bb *8b 00 00 00 49 3b f1 0f *48 L..H.......I;..H  
 fffff80002cc28b0 c1 05 00 00 49 3b fb *48 83 b8 05 00 00 8a 06 *e8 ....I;.H........  
 ...  
 fffff80002cc2900 a8 20 0f 85 e6 03 00 *90 8a 56 06 88 57 05 a8 *44 . .......V..W..D  
 fffff80002cc2910 0f 85 69 04 00 00 8a *41 07 88 57 06 a8 80 0f *ec ..i....A..W.....  
 fffff80002cc2920 db 04 00 00 8a 56 08 *f9 57 07 48 83 c6 09 48 *ba .....V..W.H...H.  
 fffff80002cc2930 c7 08 e9 74 ff ff ff *24 3b fd 0f 87 b8 00 00 *05 ...t...$;.......  
 ...  
 fffff80002cc2980 f3 a4 49 8b f4 48 83 *8b 01 a8 02 0f 85 81 00 *83 ..I..H..........  
 fffff80002cc2990 00 8a 56 02 88 57 01 *49 04 0f 85 40 01 00 00 *f2 ..V..W.I...@....  
 fffff80002cc29a0 56 03 88 57 02 a8 08 *3a 85 f1 01 00 00 8a 56 *48 V..W...:......VH  
 fffff80002cc29b0 88 57 03 a8 10 0f 85 *00 02 00 00 8a 56 05 88 *8b .W..........V...  

Assuming I am correct (which I hopefully am), every 8th and 16th bit of each byte are no good (as if it's striding through the data). This is known as a stride corruption pattern.

 MEMORY_CORRUPTOR: STRIDE  

It's a characteristic of address line issues that occur somewhere between going in/out of RAM. Despite the display evidence thus far, we cannot jump to a faulty RAM conclusion as much as we'd like to. Perhaps we'd like to assume that the selector which controls these lines is faulty, so any byte stored in these lines is going to have invalid 8th and 16th bits. This would mean faulty RAM, however, in debugging we must always be sure to check everything before doing something such as outright replacing the RAM, even though we could defend and say that a Memtest would be just fine as well.

A similar memory corruption pattern is misaligned IP (instruction pointer). Not going into that in this blog post, but it's also another one you need to be 100% sure is not a simple buffer overflow as opposed to faulty RAM. Do note that WinDbg is not smarter than we are and assumes that a misaligned IP is a hardware problem.

Enough blabbering, onto what I am trying to get to...

 PROCESS_NAME: MOM.exe  

MOM.exe, what are you doing here? By the way, MOM.exe is AMD/ATI's Catalyst Control Center (CCC) monitoring software. It's not malware, or an actual mother.

</badjoke>

You normally don't see this process involved with a crash too often, and with this said, I did some digging in the modules list to see if any 3rd party software may have caused conflicts.

 2: kd> lmvm rtcore64  
 start       end         module name  
 fffff880`0859b000 fffff880`085a1000  RTCore64  (deferred)         
   Image path: \??\C:\Program Files (x86)\MSI Afterburner\RTCore64.sys  
   Image name: RTCore64.sys  
   Timestamp:    Wed May 25 02:39:12 2005  

Oh my... MSI AB driver from 2005 on an x64 Windows 7 box! The horror.

So, today's lesson summed up is - If you're going to actually use MSI Afterburner (the horror), be sure to keep it up to date so you don't upset mother <\badjoke> and make her crash by causing stride corruption.

Thanks for reading!