Dr. Strangelove: Or How I Learned to Stop Worrying and Love SDN
If you have not seen the movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, you should. It is an important point of cultural reference in our contemporary history. I have been thinking about all this SDN stuff and various technical and business strategies over the past few weeks. Today, a colleague made reference to movie Dr. Strangelove in a passing conversation about network design. It occurred to me that there are a lot of humorous parallels between the movie and networking. This is a blog and I think it is a place between unfinished thoughts and longer form content.
I started writing about the OPEX problem of networking last year. Here is a link to a post on OPEX, but I should state that the founder of Plexxi was thinking about this problem in 2009-2010 when he founded the company and if you look through the old Nicira presentations you can see they were highlighting the same problem set. See this presentation I did in January 2013 which is really a presentation on the origins of SDN and what does it mean. For this post, I am going to pretend that SDN does not exist. In fact, if networking did not exist at all and a group of us were put into a room and asked to create a network, I doubt that we would come up with anything remotely resembling what we have today. This might be a good exercise for IT architects to go through. How would you design a network today, forgetting the limitations of the last 20-30 years of networking? That is what we did at Plexxi and continue to do each day. That is point of this post. If I was to go design a system called a network to connect compute, storage and users I would want that network to have a number of characteristics.
1. Network Must Express What is Important: Solving the internet and client/server challenges was about solving the problem of connectivity. Today, we are not solving the problem of connectivity, we are solving the problem of utilization and correlation. We need to correlation the utilization of the network with workloads that the network is tasked to support. We want the network to be orchestrateable by the application and the developer of the application. That means a developer can say that application A-B-C requires these sets of characteristics from the network such as low latency, jitter sensitive, bandwidth, hop count, path or what is called service chaining. A network then has the ability to take these requirements from many applications and calculate a topology that best reflects the needs of the applications.
2. Dynamic Network: In order to make the network orchestrateable, we need the network to be dynamic. Dynamic networks can be built in the wireless or optical domains. Physically wiring a network reinforces the hardwired, physically limited design of the network. When we wire up the conventional leaf/spine, switched network design, we lock-in the network design requirements on day 1. If the workload requirements of the network changes after the first day, we have to no ability to change how the network is configured.
3. Purpose Built for Automation: We want the network to be built for automation. We do not want to be configuring network elements at the port or the device level through CLIs. We want to compute topologies and script the network configuration.
4. Single Interface: We need to have centralized interface points that allow the network to be orchestrated. This is not a single point, but it is not every network element. We want to have orchestration from the top down to the device – not the device up.
5. API Driven: We want the network interface to be API driven. We want an open interface that allows software developers, server administrators (i.e. DevOps) people to orchestrate the network based on the needs of their applications and services. If we want to reduce OPEX, we need to remove the network as a silo that lags behind the dynamic nature of the other elements of IT.
6. Simplification: A network can do two things: connect and disconnect devices. Simplified physical layer cabling via an advanced dynamic interconnect that translates into lower power, cooling and administrative costs. The Plexxi topology is guaranteed to remain consistent was you grow the network in size, in contrast to leaf spine where budgets and time of builds frequently does not allow this to happen.
7. Scale: We need a network that can scale concurrent with the evolutionary cycles occurring in compute and storage. The scaling problem is multi-deminsional.
When building a network with the Plexxi architecture, there are four powerful advantages:
- The controller architecture (centralized points of control and administration)
- Simplified cabling (no need to engineer a massive structured cabling plant)
- Linear cost as the network grows (no core switch upgrades)
- The use of optics provides enormous benefits at scale in terms of power and cooling
For illustrative purposes, let us look at the number servers, switches and cables required for a full fat tree network design for k port switch:
- Supports K^3/4 servers
- Requires K^2/3 switches
- Requires K^3/2 cables
Using a 64 port leaf (distribution layer) ToR switch would result in fat tree design supporting 65,536 ports, 5,120 switches and 131,072 cables. In Plexxi network, the same design to provide 65,536 ports would result in 1,366 switches and 2,732 cables. The reduction in the number of switches and cables is a direct result in the use of optics and a controller (i.e. SDN) architecture. These advantages result in significant operational cost reductions.
Customers can decide to build a Plexxi network. A Plexxi network is built to deliver a long term OPEX reduction and we can do it at scale. We have an API driven controller. We have a dynamic interconnect. Our standards based, merchant silicon based ethernet switch was purpose built for automation from the controller. A Plexxi network is simple. It is an optical fabric orchestrateable via the controller. It is built to correlate the workloads expressed to it from the user applications via the API. In legacy networks, we spend a lot of time determining state, letting legacy protocols and QoS and firewalls figure out the topology. In a Plexxi network we figure out what is important, calculate topology using that information and then program the network to configure 100% of the bandwidth. We do not want to have a network that is 20% utilized with one person per 100 devices. We think we can do a lot better than those metrics.
/wrk
The movie does a good job of illustrating the cost of dogma, which is why it works for networking.