By a former Azure Core engineer

Article URL: https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion Comments URL: https://news.ycombinator.com/item?id=47616242 Points: 925 # Comments: 409

By a former Azure Core engineer
By a former Azure Core engineer Photo: Hacker News

Inside the complacency and decisions that eroded trust in Azure—from a former Azure Core engineer.

This is the first of a series of articles in which you will learn about what may be one of the silliest, most preventable, and most costly mishaps of the 21st century, where Microsoft all but lost OpenAI, its largest customer, and the trust of the US government.

I joined Azure Core on the dull Monday morning of May 1st, 2023, as a senior member of the Overlake R&D team, the folks behind the Azure Boost offload card and network accelerator.

I wasn’t new to Azure, having run what is likely the longest-running production subscription of this cloud service, which launched in February 2010 as Windows Azure.

Furthermore, I contributed to brainstorming the early Overlake cards in 2020-2021, drafting a proposal for a Host OS <-> Accelerator Card communication protocol and network stack, when all we had was a debugger’s serial connection.

I also served as a Core OS specialist, helping Azure Core engineers diagnose deep OS issues.

I rejoined in 2023 as an Azure expert on day one, having contributed to the development of some of the technologies on which Azure relies and having used the platform for more than a decade, both outside and inside Microsoft at a global scale.

As a returning employee, I skipped the New Employee Orientation and had my Global Security invite for 12 noon to pick up my badge, but my future manager asked if I could come in earlier, as the team had their monthly planning meeting that morning.

I, of course, agreed and arrived a few minutes before 10 am at the entrance of the Studio X building, not far from The Commons on the West Campus in Redmond.

A man showed up in the lobby and opened the door for me.

I followed him to a meeting room through a labyrinth of corridors.

The screen projected a slide where I recognized a number of familiar acronyms, like COM, WMI, perf counters, VHDX, NTFS, ETW, and a dozen others, mixed with new Azure-related ones, in an imbroglio of boxes linked by arrows.

I sat quietly at the back while a man was walking the room through a big porting plan of their current stack to the Overlake accelerator.

As I listened, it was not immediately clear what that series of boxes with Windows user-mode and kernel components had to do with that plan.

After a few minutes, I risked a question: Are you planning to port those Windows features to Overlake?

The answer was yes, or at least they were looking into it.

The dev manager showed some doubt, and the man replied that they could at least “ask a couple of junior devs to look into it.”
The room remained silent for an instant.

I had seen the hardware specs for the SoC on the Overlake card in my previous tenure: the RAM capacity and the power budget, which was just a tiny fraction of the TDP you can expect from a regular server CPU.

Everything was nimble, efficient, and power-savvy, and the team I had joined 10 minutes earlier was seriously considering porting half of Windows to that tiny, fanless, Linux-running chip the size of a fingernail.

That felt like Elon talking about colonizing Mars: just nuke the poles then grow an atmosphere!

Easier said than done, uh?

That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.

The man was a Principal Group Engineering Manager overseeing a chunk of the software running on each Azure node; his boss, a Partner Engineering Manager, was in the room with us, and they really contemplated porting Windows to Linux to support their current software.

At first, I questioned my understanding.

Was that serious?

The rest of the talk left no doubt: the plan was outlined, and the dev leads were tasked with contributing people to the effort.

It was immediately clear to me that this plan would never succeed and that the org needed a lot of help.

That first hour in the new role left me with a mix of strange feelings, stupefaction, and incredulity.

The stack was hitting its scaling limits on a 400 Watt Xeon at just a few dozen VMs per node, I later learned, a far cry from the 1,024 VMs limit I knew the hypervisor was capable of, and was a noisy neighbor consuming so many resources that it was causing jitter observable from the customer VMs.

There is no dimension in the universe where this stack would fit on a tiny ARM SoC and scale up by many factors.

It was not going to happen.

I have seen a lot in my decades of industry (and Microsoft) experience, but I had never seen an organization so far from reality.

My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.

Somewhere, I knew it was going to be a fierce uphill battle.

As you can imagine, it didn’t go well, as you will later learn.

I notably spent more than 90 minutes chatting in person with the head of the Linux System Group, a solid scholar with a PhD from INRIA, who was among the folks who hired me on the kernel team years earlier.

His org is responsible for delivering Mariner Linux (now Azure Linux) and the trimmed-down distro running on the Overlake / Azure Boost card.

He kindly answered all my questions, and I learned that they had identified 173 agents (one hundred seventy-three) as candidates for porting to Overlake.

I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node, what they all did, how they interacted with one another, what their feature set was, or even why they existed in the first place.

Azure sells VMs, networking, and storage at the core.

Add observability and servicing, and you should be good.

Everything else, SQL, K8s, AI workloads, and whatnot all build on VMs with xPU, networking, and storage, and the heavy lifting to make the magic happen is done by the good Core OS folks and the hypervisor.

How the Azure folks came up with 173 agents will probably remain a mystery, but it takes a serious amount of misunderstanding to get there, and this is also how disasters are built.

We are still far from the vaporized trillion in market cap, my letters to the CEO, to the Microsoft Board of Directors, and to the Cloud + AI EVP and their total silence, the quasi-loss of OpenAI, the breach of trust with the US government as publicly stated by the Secretary of Defense, the wasted engineering efforts, the Rust mandate, my stint on the OpenAI bare-metal team in Azure Core, the escort sessions from China and elsewhere, and the delayed features publicly implied as shipping since 2023, before the work even began.

If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.

Source: This article was originally published by Hacker News

Read Full Original Article →

Share this article

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

Maximum 2000 characters