Secure Boot failure on ESXi 7.0 U3 - Solved

At the beginning of April, an ice storm hit Quebec and caused power outages lasting several days, particularly in areas with older trees where utility lines are not underground. When I started my ESXi host after 5 days offline, I ran into an error I've never seen before:

The system has found a problem on your machine and cannot continue. UEFI Secure Boot Failed

It seemed like all of the VIBs were now failing UEFI Secure Boot validation.

My first instinct was that due to an unclean shutdown, data on the boot volume was corrupted, and a reinstall would be required. The voice of my VMware TAM was in the back of my mind: I'm running non-HCL hosts, with NVMe SSDs not on the VSAN HCL, I should be thankful this works at all. And he's right. And I am. 

However, I'm happy to report this had a quick fix: while goofing around in the BIOS to see how I could bypass Secure Boot and load ESXi, I noticed that my system clock had rolled back to 2017! As it turns out, I had been foiled by a $1 CR2032 battery meant to keep time while the board is not connected to power. By setting the clock to the correct date, I was able to boot ESXi normally.

So why did this happen? 

To understand, we need to know a bit about Secure Boot:
Secure boot is a security standard developed by members of the PC industry to help make sure that a device boots using only software that is trusted by the Original Equipment Manufacturer (OEM). When the PC starts, the firmware checks the signature of each piece of boot software, including UEFI firmware drivers (also known as Option ROMs), EFI applications, and the operating system. If the signatures are valid, the PC boots, and the firmware gives control to the operating system.
This means that each component in the boot "chain" from the computer firmware to the OS/drivers is digitally signed, and must validate the signature of the next step before proceeding. In this case, we were able to partially boot, but when it came time to load the VIBs from ESXi 7.0U3, the signature check failed, causing the error screen shown in the picture. 

I haven't inspected the signatures myself, but the most likely explanation is that the beginning of the certificate validity (the "NotBefore"), from the perspective of the machine who believed it was 2017, was in the future. Correcting the date in the BIOS allowed the certificate validation to pass and load ESXi. I've since changed the CMOS batteries on all of my homelab devices and hopefully won't have to think about this for another 3-4 years.

Comments

Popular posts from this blog

MikroTik CRS309-1G-8S+IN in 2023

ADFS RelayState (IdP-initiated sign-on deep links)