Compute Conundrum for Vendors
It is Saturday April 28, 2012 and I have been involved in an ongoing discussion with a number of friends on what was the significance of Urs Hölzle’s presentation at ONS 2012. I do not care that Google has internally built servers and network switches. Google is a unique, single market. The following conclusion to me is incorrect: Google built internal servers and switches and they are using OpenFlow and therefore the companies that build servers and switches will be under enormous pressure from the network DIY movement and most likely will go out of business.
I think the following conclusion is correct: Google built a network that is adaptable to the compute requirements of their business.
Stepping back for a moment, you can find coverage of Google’s presentation here and here. I wish Google was open enough to post their presentations as I think this would be a benefit to the technology community at large, but for a company who wants to “organize the world’s information and make it universally accessible and useful” this apparently does not include their information. I was in the audience for Urs’s presentation and there are many application points to be discussed, but I want to focus on these points which I am paraphrasing:
- Cost per bit does not naturally decrease with size
- WAN bandwidth should be managed as a resource pool, throw all applications on it and manage the pool of resources based on the requirements of the applications
- Do not manage the network elements as it takes too long to re-compute a working topology when a network element fails, it is not deterministic and sometimes succeeds in finding a new network state and sometimes it does not
- The advantage of centralized traffic engineering utilizing SDN allows for dynamic flow based control that is deterministic is behavior. The need to overprovision the network is mitigated.
- Managing the network as a resource pool for the needs of the applications allows for the pre-compute and continuous compute of optimal network topologies
- New server platforms (e.g. Romley) have a lot of capacity and should view the network as a resource pool for their applications
What this Means for Vendors:
When I wrote a few months ago that the network as like the F-4 Phantom II, I made a reference to pushing complexity to the edge of the network, and I think a portion of that post is worth repeating again after the Google presentation at ONS. “We need a new network and we need to start with two design principles. The first is to be guided by principle of end-to-end arguments in system design and push complexity to the edge and the second is to accept that the network does only one of two actions: connect and disconnect. All the protocols and techniques I listed in the third paragraph (which is about 1 bps of all the stuff out there) were created because as networking people we failed to heed the lessons of the before mentioned principles. I have been posting about this before here and here, and this is post is an extension of those thoughts because I am continually surprised that people think that network is more important than the application and the compute point and that the way to fix the network is to add more stuff to make it work better.
I think this is just crazy talk from people who are buried so deep in the networking pit that they do not realize that they are still using Geocities and they are wondering where everyone has gone. There is a new network rising and instead of connecting a device to all devices and then using 500 tools, protocols and devices to break, shape, compress, balance and route those connections between devices, we are going to have a network that connects the compute elements as needed. We are not going to build a network to do all things; we are going to build a network that facilitates the applications at the compute point, thus pushing complexity to the edge. I think of it as the F-15 – not the F-4 and with this new network, we will need less consultants to explain how it works.”
Go forward a few months and the compute conundrum is becoming visible to vendors. Like an iceberg on a dark, still night hundred years ago, the question we can answer in the future is: what vendors avoided the collision? Companies do not go out of business overnight, but technology shifts and missed product cycles hastening their decline.
The shift and conundrum I am writing about is the presentation that Google presented at ONS. For the people who manage networks, what they heard from Google is it is possible; and one can argue advantageous, to manage flows across the network based on the requirements of the compute element. I am not describing a variation of traffic engineering in the network element using protocols, QoS, compression, priority queuing; those are tools that are in the network and not aware of the compute state. What I am describing is what Urs described in his presentation that the advantage of centralized traffic engineering utilizing SDN allows for dynamic flow based control that is deterministic in behavior and the process flows from the compute element. Therein lays the conundrum for many vendors and an OpenFlow spigot inserted as an after market add-on is not a solution to the conundrum.
The evolution of the network will result in two vendor groups. Group 1, which will be the larger group, with aging, legacy control planes of their own development that fail to actively participate in the dynamic centralized traffic engineering function. Group 1 will slowly be regulated to a passive position in terms of route calculation in the network. Group 2 will be a smaller group, but in the valuable position of participating in the calculation of the optimal network configuration because this group has knowledge of the compute element. As always, my assumptions and hypotheses could be incorrect and it is possible that have no idea what I am writing about.
* It is all about the network stupid, because it is all about compute. *
** Comments are always welcome in the comments section or in private. **