The Disputation Between Pessimal and Optimal in Networking
It has been well over a year since I last posted and the cause of the writing draught has been work. I have simply been too tired and too busy to write, which is somewhat of a high-quality problem. Taking the time to construct my thoughts into words helps me craft my narrative to prospects and customers. This is a post about what I say to prospects and customers every week and it has changed and evolved over the past five years.
Since I last posted on the blog, a lot of change has occurred in networking. Arista has been on tear, Dell and EMC merged, Brocade disappeared, Extreme is now the home of Extreme, Foundry and Enterasys, HP resells Arista and Cisco is still Cisco. I rarely get questions on Openflow and to that point, since my last post the ONF merged with ON.Lab. Four years ago every customer call involved some discussion of Openstack, but today it is a rarity and mainly a hold over in customers and prospects that have spent the last four years committed to the Openstack cause. Network virtualization, well that did not happen on a scale meaningful to anyone. The question every analyst, investor and customer prospect asks is: who are Plexxi’s competitors? The primary competitors that I encounter on a regular basis are Cisco and Arista and I think the reason for that is I am often selling fabrics to customers who have a lot of compute and storage and their needs are always changing and evolving around their business.
Life After Clos
When I meet people outside of work for the first time and they ask me what I do for work, I always hesitate to answer. I try to say something about data center fabrics, but eventually it is easier to say I sell parts of the cloud, you know, the thing your smartphone and computer backup to and people just nod. When I meet people in a work environment who have a technical background I answer the question much differently and this post is a condensed version of a typical networking sales call I have every week whether it is on the white board, in a video chat or in a Starbucks.
What Plexxi builds is a fabric as a service (FaaS). We take a common switch building block, in accompanying diagrams a 3.2Tb/s switch (can be other switch options) white box and collapse the spine layer into the switch. The switch becomes the access layer and the spine switching layer. We can build multi-dimensional fabrics, most commonly a one dimensional (1D) or two dimensional (2D) torus, that provides an immense amount of configurable paths (i.e. bandwidth) between end-points. A torus is the most efficient design approach for computing and storage fabrics. The easiest way to think of a torus design is to layout a grid (like Manhattan) on a piece paper, roll it into a tube and connect the ends. This visualization tends to demystify the design concept. The design would also be completely useless if we ran spanning tree and other legacy fabric protocols. Instead, we use math.
A question often asked by people is about uplinks to spines and spines of spines and sometimes even about super spines. To understand a Plexxi fabric, the first step is understand that we do not need a discrete network or spine layer to build a fabric. Ethernet switching silicon is so powerful today and getting more powerful, that building discrete switching layers for fabric designs is really a hold over from era of networking design that is beginning to fade away.
What is different about Plexxi is we can build this fabric and understand what is connected to the fabric (i.e. locality). By understanding where devices are located in fabric, and the multitude of paths between these end points, a Plexxi Controller can optimize the forwarding topologies in the fabric, a process we call fitting and enable the network to operate as a unified system controllable by the user. See the picture to left from 2012 about designing and managing networks as a system pool. Without spending a large portion of time on the technical details of a Plexxi fabric, there is a default physical path construct and two controllers. One controller is located on the switch and is responsible for packet forwarding. The second controller is the logical fabric controller. It is within the logical fabric controller that the multi-commodity flow algorithms developed by Plexxi are located. In a Plexxi fabric the path of packets is never a mystery. The packet forwarding rules are known to the controller on the switch as well as the fabric controller. There is no mystery and network engineers do not need to have twenty CLI windows open trying to figure out forwarding paths or what spanning tree is doing.
When the Plexxi Controller is connected to an application, the system can correlate between fabric end-points, topology paths and application workloads. This is the difference between a Plexxi network and a legacy network. In a legacy network, fabrics are really unaware of workloads and network devices self-register and operate independent of each other. A switch or a router advertises routes, adjacently, etc. and then forwards packets to the next hop, not really aware of whether the next hop is the best path option to choose. We call that a pessimal solution and it is a fundamental tenant of legacy network design.
In the past it was very difficult to solve for system wide correlation and this gave rise to an orthogonal concept only found in the networking field that most random = best optimized. Networking is the only engineering discipline in which you find technical people espousing that most random is best optimized. You would never expect a storage person to tell you that the best practice is to just scatter data on any random LUN or drive? A compute person would not tell you that randomly scattering VMs and app workloads across randomly chosen CPUs is a best practice. Construction engineers do not randomly scatter load barring beams around the frame of a building. The FAA does not randomly send planes to various destinations hoping that it is the best path to the destination. Why do network people believe that random is best optimized?
Because in the past we did not have both ends of the equation and without this knowledge, the best approach was to make forwarding topologies as random as possible because we did not have the knowledge required to compute the best answer, as the best answer was always the next hop. Now that we know the end points and can act on them as a system, why would we choose random rather than calculated? This is the disputation between pessimal and optimal in networking. In a world in which we can calculate the best answer, we chose to use that approach in a Plexxi fabric and that is what makes Plexxi different from a legacy network.
Another analogy I use when presenting is I ask people about their commute home. Do they know their destination? Do they know their starting point? Most people answer in the affirmative. Do they know something about the constraints regarding their commute? Maybe they are in the rush today. Maybe they need to run some errands on the way home. Maybe they prefer to avoid freeways. If you use Waze or Google Maps and have ever experienced the rerouting effect, that is the traffic controller looking at inputs from multi-commodity flow algorithms and adjusting your route home based on what it knows about your needs and the surrounding conditions. That is what modern networking using a controller architecture for the fabric does as well.
Building Switching Fabrics Small and Large
Legacy networking is constrained by a demand curve that is deflationary in terms of bandwidth and has been supported by the structure of a relatively unchanged network design for the past 20-30 years. Over time the cost of bandwidth has declined (i.e. deflationary) primarily driven by Moore’s Law and this trend has provided the networking industry with a free design pass for decades. Continuous improvement in switching silicon performance has been the design crutch that we have all benefitted from for decades. At each inflection point in which the pessimal network design has faced a moment of challenge, the answer has been a bandwidth upgrade: 1M to 10M, 10M to 100M, 100M to 1G, 1G to 10G, 10G to 40G, 25G and 100G.
Money increases relative to capacity. I can buy more capacity if I spend more money, but surplus and scarcity are two sides of the same coin. The dilemma for the network buyer is that capacity is scarce (i.e. inflexible, static, unusable) relative to money and money is scarce relative to capacity. It is for this reason that network engineers are faced with an inflection point in designing the next generation of compute and storage network fabrics and really need modern tools to build fabrics that are adaptable to services.
Fabric as a Service
When I talk to prospects about building networks, the design philosophy we advocate is to switch where you can and route where you must. A relatively simple process, I will walk through how to build a fabric as a service (FaaS) from a four
switches, to a pod of ten rows by ten racks (10X10) using a simple Tomahawk based white box switch. Often I hear network people tell me that they want to build a Clos network or their plan to upgrade to a next-generation Clos network. I have even had prospects describe the building of spines and super-spines and spines of spines, all of which are design techniques to over come legacy technology limitations.
Starting with Four Switches
When designing data center fabrics there are all sorts of choices that affect the design. I tend to start with the question of the speed of the server NICs and how many NICs per server? How many servers per rack? I want a count of the client ports for servers and storage. Will the servers be crossed racked? How far away are the routers? How much exit bandwidth is required from the row, the pod and will the routers support 40G or 100G? What kind of apps will run in the POD and is there any consideration of workload requirements between racks, rows and the exits?
There are also many physical design aspects that matter. Are the rows uniformly spaced? I have seen data center designs in which there are non-uniformed gaps between racks and rows. From a Plexxi perspective, we have many design options for how the LightRail fabric can be wired. We can locate our passive patch panels in a distributed or centralized design. I tend to prefer to centralize the wiring plant for easy of configuration and expansion, but it is really up to the team based on the needs of the pod and the physical limitations.
For illustration purposes, I am going to use the same picture format showing the design and various specifications. To start with, this is an ODM designed and manufactured ethernet switch. It is a 32×100 Tomahawk based white box. You could take this switch and install Cumulus Linux or Dell OS10 and it would act like a switch with the OS of your choice. I tell prospects if you do not like the Plexxi network, feel free to choose any OS from any of the four or five network vendors offering a switch OS, as the hardware will never be wasted. To build a Plexxi network, we install the Plexxi OS using ONIE and then install the Controller.
The four switch design shown in the in the picture has fabric built into the switch. This is the part that I was referring to when I stated that we turn uplink ports into
fabric ports and can dispense with the need for spine or core switch. We can build a high-density fabric based on 10G or 25G or 100G fabric links that performs better without the need to design a spine layer. The bill of materials (BOM) for this design is four S3e switches and one Pod Switch Interconnect (PSI), which is a passive shuffle box. For such a small design, we only need a single LigthRail dimension so the fabric is 2.4 Tb/s (600G x 4 S3e switches). The OSR of this design is 1.73:1 and it supports 10G, 25G, 40G, 50G and 100G clients.
From Four Switches to Row
To create a row of ten switches, we need to add six switches. For illustration purposes, I showed the network drawing as two rows of five switches, but this was for PowerPoint constraints and not actual technical reasons. A ten switch row
provides nicely for 40 servers with 2x10G or 2x25G NICs per server in a rack with plenty of ports left over to connect to storage and external routers and switches. The LightRail fabric in the row would provide 6 Tb/s of bandwidth for the ten racks, which is a perfect place to put your compute and traffic intensive applications. The OSR of this design is still 1.73:1, which is the happy result of the linear scale property of the fabric design.
5×10 Half Pod
Earlier in this post I mentioned that a Plexxi fabric could be one or two-dimensional fabrics. Most of our customers who build fabrics that are 20 switches or less, typically build a single dimension fabric for cost reasons. For performance
benefits, a second fabric dimension can be added at anytime, but for most designs
the duality between surplus and scarcity becomes compelling as the fabric grows >20 switches. Fifty switches or half pod is good place for a multi-dimensional fabric. In this design we have simply added four additional rows of ten switches and a second LightRail dimension to the fabric. The actual fabric-wiring schema is not drawn accurately in this diagram. The X dimension is not a collection of five LigthRail fabrics; it is really a single fabric dimension. The same is true for the Y dimension and we typically design the fabric with an offset between the X and Y fabric dimensions. Enough about wiring complex multi-dimensional fabrics. If you have a deep interest in how the fabric can be wired up and how the forwarding topologies can be manipulated, you can request a technical briefing.
10×10 Full Pod
Here is the full 10×10 pod design with 100 switches. Ten rows of ten supporting upwards of 4000 servers in a single pod design. The same switching element is used throughout the four designs. We can even change the fabric side optics to use 100G rather then 25G. When using 100G fabric side optics, the overall OSR becomes 1:1 and the LightRail fabric size increases from 120 Tb/s to 160 Tb/s in the 10×10 pod design.
When I speak with customers and prospects about building new networks (and that is my day job), I inquire as to new applications, storage and compute needs that are leading technology decisions within the organization. Often people tell me about HCI, CI, Kubernetes, VDI, etc. When a customer tells me about all these exciting technologies they plan to introduce to the organization, I am then confused when they state that they plan to deploy these bold technology choices on the same old network. It makes very little practical sense to make a bold compute/storage decision in favor of some variation of HCI/CI and then go take you new compute/storage solution and run it on network technology from 20 years ago. This is like buying a Ferrari and then driving to a Tuscan hill town and trying to drive it up the narrow cobble stone road to piazza for gelato. I am sure you will get your Ferrari to the piazza and the gelato will be delicious, but is that really the best use of a Ferrari? Personally, I would like to buy a Ferrari and take it out for a long drive on the autostrade.
/wrk