Dawn of the Multicore Networking
Off to VMworld, flying my favorite airline Virgin America, blogging, wifi, satellite TV and working; which is just cool when you consider when I starting traveling for business I had a choice of a smoking seat and a newspaper. I recently posted additional SDN thoughts on the Plexxi blog, which was a follow-up to the last post on the SIWDT blog.
The following is the post that I alluded to last month that I have been writing and revising for a few months. It all started several months ago when I was reading Brad Hedlund’s blog, in which he posted several Hadoop network designs. I am referencing the post by Brad because it made me stop and think about designing networks. I have been talking to a lot of people about where networks are going, how to design networks, blah, blah, just click on the “Networking” tab to the right and you can read more than a years worth of postings on the subject. Side note, if you experience difficulties falling asleep reading these posts might serve as a cure.
As a starting point, consider the multicore evolution paradigm from a CPU perspective. In a single core design, the CPU processes all tasks and events and this includes many of the background system tasks. In 2003 Intel was showing the design of Tejas, which was their next evolution of the single core CPU with plans to introduce in late 2004. Tejas was cancelled due to heat caused by extreme power consumption of the core. That was the point of diminishing returns in the land of CPU design. At the time AMD was well down the path of a multicore CPU and Intel soon followed.
From a network design perspective, I would submit that the single core CPU is analogous to the current state of how most networks are designed and deployed. Networks are a single core design in which the traffic flows to a central aggregation layer or spine layer for switching to other parts of the network. Consider the following example:
- 100,000 2x10G Servers
- Over-Subscription Ratio of 1:1
- Need 2,000,000 GbE equivalent = 50,000 x 40 GbE
- Clos would need additional ~100,000 ports
- Largest 40 GbE aggregation switch today is 72 ports
- 96 ports coming soon in 2U, at 1-2 kW
- 100k servers = 1,500 switches
- 1.5-3.0 MW – just for interconnection overhead
This network design results in what I call the +1 problem. The +1 problem is reached when the network requires one additional port beyond the capacity of the core or aggregation layer.
In contemporary leaf/spine network designs, 45 to 55% percent of the bandwidth deployed is confined to a single rack. Depending on the oversubscription ratio this can be higher such as 75% and there is nothing strange about this percentage range, as network designs from most network equipment vendors would yield the same results. This has been the basis of the networking design rule of: buy the biggest core that you can afford, scale it up to extend the base to encompass as many devices connections as possible.
Multicore CPUs come in different configurations. A common configuration is what is termed symmetrical multiprocessing (SMP). In a SMP configuration, CPU cores are treated as equivalent resources that can all work on all tasks, but the operating system manages the assignment of tasks and scheduling. In the networking world, we have provided the same kind of structure by creating work group clusters for Hadoop, HPC and low latency (LL) trading applications. The traditional single core network design that has been in place since IBM rolled out the mainframe has been occasionally augmented over the years with additional networks for mini computers, client/server LANs, AS400s and today for Hadoop, low latency and HPC clusters. Nothing really new here because eventually it all ties back into and becomes integrated with the single core network design. No real statistical gain or performance improvement is achieved scaling is a function of building wider to build taller.
Multicore CPU designs offer significant performance benefits when they are deployed in asymmetrical multiprocessing (AMP) applications. Using AMP some tasks are bound to specific cores for processing, thus freeing other cores from overhead functions. That is how I see the future of networking. Networks will be multicore designs in which some cores (i.e. network capacity) will be orchestrated for HPC, priority applications and storage, while other cores will address the needs of more mundane applications on the network.
The future of the network is not more abstraction layers, interconnect protocols, protocols wrapped in protocols and riding the curve of Moore’s Law to build bigger cores and broader bases. That was the switching era. The new era is about multicore networking. We have pretty much proven that multicore processing in CPU design is an excellent evolution. Why do we not have multicore networking? Why buy a single core network design and use all shorts of patches, tools, gum, paperclips, duct tape and widgets to jam all sorts of different applications through it? I think applications and workload clusters should have their own network cores. There could be many network cores. In fact, cores could be dynamic. Some would be narrow at the base, but have high amounts of bisectional bandwidth. Some cores would be broad at the base, but have low bisectional bandwidth. Cores can change; cores can adapt.
I think traditional network developers are trying to solve the correct problems – but under the limitations of the wrong boundary conditions. They are looking for ways to jam more flows, or guarantee flows with various abstractions and protocols into the single core network design. We see examples of this every day. Here is a recent network diagram from Cisco:
When I look at a diagram like this, my first reaction is to ask: What is plan B? Plan B in my mind is a different network. I fail to see why we as the networking community would want to continue to operate and develop on the carcass of the single core network if it is broken? In other words, if the single core network design is broken, nothing we develop on it will fix it. A broken network is a broken network. Let the single core network fade away and start building multicore networks. As always it is possible that I am wrong and someone is working on the silicon for 1000 port terabit ethernet switch. It is probably a stealth mode startup call Terabit Networks.
* It is all about the network stupid, because it is all about compute. *
** Comments are always welcome in the comments section or in private. **