SEARCH FINANCIAL SERVICES INFRASTRUCTURE SECURITY SCIENCE INTERVIEWS

 

     

Accelerating open source and hardware innovation to meet data and security demands

By Kushagra Vaid, Distinguished Engineer, Microsoft Azure

May 13, 2020

The Open Compute Project (OCP) Global Summit, kicking off virtually on May 12, is where a vibrant and growing community will come together to help grow, drive, and support the open hardware ecosystem.

This year’s theme for the OCP Global Summit—Open for all—is a theme that certainly resonates for us as we continue our open source journey alongside industry partners and the open source community. At Microsoft, we believe in and value a community approach to building better technology, because we truly believe that open source makes computing better for the world.

Each year at the OCP Summit, we contribute new innovations that address the most pressing challenges for demands on cloud infrastructure. This year is no different as we continue on our endeavor to accelerate open source innovation and partner with industry leaders and the community to ensure access to the latest hardware security technology.

Accelerating open source innovation

We believe data privacy and security are fundamental to building and maintaining trust in the cloud. By working alongside partners and the community, we can help provide access to the latest hardware security technology. We’ve been working closely with partners such as Intel, AMD, Broadcom, Nuvoton, Mellanox, and NXP to collaborate and contribute to Project Cerberus. Today, we’re excited to announce that Project Cerberus source code and tools, including system processes, platform architecture, reference architecture, and firmware source are being open sourced to enable the broader community to collaborate and contribute to the architecture and technology. We’re looking forward to Project Cerberus-enlightened products coming to the market in 2020.

New standards to drive interoperable modularity for AI capabilities and security

Modular Building Block Architecture

Modeled after Microsoft’s Project Olympus, the Modular Building Block Architecture (MBA) initiative supports our priority to provide interoperable innovations. MBA clearly defines interfaces and physical boundaries for independent development and contributions through three stages: 1) base specification for comprehensive architectural definition, 2) design specification, including design implementation and collateral, and 3) product contribution. Examples of MBA include Open Accelerator Infrastructure (OAI) and Datacenter-ready Secure Control Module (DC-SCM).

Open Accelerator Infrastructure (OAI)

After jointly contributing the OCP Accelerator Module (OAM) specification with Facebook and Baidu in March 2019, Microsoft collaborated with Facebook and Baidu and formed the Open Accelerator Infrastructure (OAI) subproject to jointly develop an open, interoperable, modular infrastructure for supporting emerging accelerators for artificial intelligence (AI) and high-performance computing (HPC). Since inception, OAI contributors have worked through an OCP joint development agreement (OAI-JDA) and have developed the specifications for various modules and three reference systems. Recognizing this interoperability opportunity, several accelerator companies such as Intel/Habana Labs, AMD, nVidia, and Xilinx have announced their plans for OAI-compatible OAMs.

Datacenter-ready Secure Control Module (DC-SCM)

Another example that follows Project Olympus’ modular tenets includes DC-SCM and its interface, DC-SCI. As the “heart of the motherboard,” DC-SCM includes all essential elements of a server motherboard excluding the CPU, Memory Slots, and IO Slots. It includes BMC, RoT, System, and BMC Flash, as well as other ancillary components required to deliver a datacenter-compatible, secure, control module. As a scheme for the division of roles and responsibilities, this approach allows a “win-win” opportunity for server suppliers, as well as for hyperscale datacenters.

The DC-SCI offers a standard demarcation point for DC-SCM to allow AMD-, Intel-, and ARM64-based servers to quickly and easily be built ready for OCP partner companies. While different OCP partners may utilize one server baseboard by installing their favorite flavor of DC-SCM (with potentially different BMC, RoT, etc.). As a collaborative example, Microsoft co-presented an earlier draft of the DC-SCM concept at the OCP Regional Summit in Amsterdam (September 2019) with Google.

We are looking forward to contributing DC-SCM/DC-SCI base specification to OCP and several ODMs are engaged to integrate DC-SCM into their server solutions over time. The value of DC-SCM and its standard interface has been recognized. The next effort will be its inclusion into the OAI specification as OAI-SCM.

Driving common media standards to improve serviceability

Following on our collaborative efforts to bring the M.2 and E1.L NVME drives into ODM systems, we are introducing the E.1S, a new contribution of a 1U and 2U chassis design optimized for Intel Cascade Lake-based & AMD Rome-based systems. We believe this new standard will help catalyze broader adoption across the enterprise by enabling smaller capacity increments and higher IOPS per GB, as well as easing cost constraints.

Establishing liquid cooling standards to enable next-gen GPU, HPC, and AI workloads

While many are arguing that we are reaching the limits of Moore’s law, we believe that Moore’s law can be applied to halving the cost every two years for the whole data center campus—not just the chip. This means that collectively the network, the hardware, and datacenter are optimized as a system to enable that goal. We see liquid cooling and—in particular, immersion cooling—are enabling some new architectures that we have not even begun to consider. We see OCP and its partners as a way to accelerate this development.  

While liquid cooling is a technology that has been used in specific use cases, such as bitcoin mining, we are not only investing in the solutions and technologies that will power new architectures, but also focusing intensely on the challenges that will come into play as we look to extend the reach of these capabilities to a hyperscale cloud. Cold-Plate solutions have been used in supercomputers for years, but they’re very customized and not designed for the serviceability and reliability required by the cloud. Microsoft is collaborating with the OCP ecosystem, particularly Facebook and CoolIT, to establish standards for developing blind-mate Cold-Plate solutions for both Project Olympus systems and Open Rack v3. Even though these racks are different, there are many common components:

  • Standardization of blind-mate connection, liquid cooling manifolds, and system flow rates.
  • Liquid pumps and Rear door heat exchangers.
  • Immersion cooling that enables significantly simpler systems and platform designs that are not possible with either air or cold-plate. The end-to-end solution results in the least energy consumed of all cooling methods. Many investigations and collaborations are running in the OCP Advanced Cooling Solutions subteam.
  • Industry, 3M in particular, on fluids.
  • Wiwynn as a strong partner bringing immersion systems to OCP.

Joint collaborations with OCP Rack and Power team, Facebook, and the broader ecosystem for Open Rack v3 (ORv3) standards.

Microsoft is working with ORv3 Power Shelf suppliers to deliver high-power racks and systems incorporating the learnings and experience of operating a global cloud. ORv3 holds promise for even deeper reuse of OCP gear.

We look forward to connecting virtually at the OCP Global Summit. We have several sessions throughout the summit, which you can find in the OCP Summit schedule.

Terms of Use | Copyright © 2002 - 2020 CONSTITUENTWORKS SM  CORPORATION. All rights reserved. | Privacy Statement