Are You Building a Company or Managing an Engineering Project?
Earlier today I read this post titled “SDN is Not a Technology, It’s A Use Case.” Shortly after, I found myself in a conversation with one of our lead algorithmic developers. We were discussing recent developments in the deployment of photonics inside the data center and papers we had read from Google researchers. At Plexxi, we have already begun the thinking around what our product architecture will look like in 3-5 years. In the conversation I was having with the algorithmic developer, it occurred to me that we sometimes become so immersed in what we are doing on a daily, weekly, quarterly basis that we lose track of whether we are working on a project or building a company.
At Plexxi, we are building a company. This is probably the point at which I depart the room in regard to conversations about SDN. I also find it ironic that a person at a standards body would claim that a collection of technologies, in this case SDN, still does nothing. Plexxi is my fifth startup. Over the years I have learned that engineering projects do not equate to building startups. I like to build companies and for me innovative engineering projects (i.e. have some level of risk wherein risk correlates to value) are really the proofs along the way to building the company. I have a notebook (which is now on Evernote) that I keep a list of insights I learned through all my startup stops. One of the lessons is fast failure. Startups need to be cycling through their product development efforts quickly. I call this fast failure.
What we are doing almost 2.5 years into Plexxi, is bringing to market the early engineering proofs of our product set, which we believe provide the foundation for building phonically agile networks at hyper scale; but in time we will get that level. We are presently in the stage of deploying our product set to customers. We are tweaking the product set, prioritizing new features, improving scale and capabilities. Over the past 3-6 months the number of projects being undertaken by our development teams feels as if it went hyperbolic. We are doing a lot and that is being driven by market needs. The successful balance is somewhere at the intersection of way too much too get done, way too many legacy features and so many future features that no one can find a use case. We have to balance the needs of the business today with the long term vision. This is all normal and I would say it is a red flag for any startup to not be in this state at some point.
When NASA decided to go to the moon, they did not solve the problem of going to the moon as the first engineering objective – it was a long term objective. They divided the work into many steps, what I call engineering proofs, over eight years. I remember competing against Cisco products in the late 1980s. I feel confident in saying that the products Cisco is building today, were not on their 1988 engineering plan of record. What Cisco was doing in 1988/1989 was building out the engineering proofs along the way to what they are today: IGRP, RIP2, IOS, new hardware platforms, etc. A huge, under appreciated advantage that Cisco had over Wellfleet, Proteon, CrossComm, Retix, ACC and others, was they were able to (i) cycle through the engineering proofs faster, (ii) get those proofs to market and (iii) acquire real operational experience with those products, proofs. They repaired and improved their products because they had experience with the unexpected before all others came to market with routers and switches. Networking is not simple.
One of the architectural choices Plexxi made two years ago was to build a federated co-controller architecture. The details can be found on our website, but a short summary is Plexxi chose to put a high performance controller on the switch element for forwarding state functions. Our central controller was designed for topology and algorithmic calculations. As network guys, Plexxi knew that central controller dependency for forwarding decisions would have negative long term performance and fault domain implications. Over this past weekend we had a central controller issue at a customer with a production network. The network switches lost connectivity with the central controller. I was bit surprised to learn of this issue when I came into the office on Monday, because at no time did I receive an alert as to a customer trouble ticket. There was a reason I did not see a trouble ticket – no Severity 1 trouble ticket was created. The network did not fail and the customer did not have an outage, the central controller was out of reach for a period of time, but the co-controllers in the switch elements kept functioning using their stored forward topologies. In a Plexxi network, we store active forwarding topologies and backup topologies for each in flight MAC address in each switch element. The use of a controller for packet forwarding decisions is often why we are considered an SDN company, although we view SDN as technique that was chosen because it was the best tool at the time to build a network differently.
Some people many think that SDN still does nothing. For us, architectural decisions taken almost two years ago were proved operationally sound and correct. Engineering proofs along the way are the foundational elements of building a company. Our end goal is to build a big company, by cycling through the engineering proofs that over the long term, enable us to build networks differently and realize the true vision of the company.
As always I could be wrong as I hacked this post together in between meetings.
/wrk
Can you explain how, over the course if an entire weekend, no MAC address aged out and where flushed?
I understand being able to maintain existing topology for some finite amount of time as well as not being able to learn/distribute new MAC address during controller outage but over an entire weekend?
The co-controllers on the switch elements did the learning and forwarding. The central controller exists for optimization, algorithmic fitting, API access, etc. The switch elements have default forwarding mechanisms just like a old school network. When the central controller is present, we can execute better fitting of workloads, if it is not present we use one of the many prior topologies calculated by the central controller when it was present or we default to traditional old school network. When the central controller comes back, it begins to analyze, inventory and calculate typologies. BTW…astute question from you.