Lions, Tigers and Bears Oh My… The Journey of Building a Next-generation SOA Data Services Platform
In 2010, the 50 lane Beijing-Hong Kong-Macau Expressway experienced a massive traffic jam that lasted for ten days and stretched for over 100 kilometers. Some drivers were stuck on the expressway for 5 days.
Even though this is an extreme case, sitting in traffic for hours is a painful experience. The dread that people have before the commute, during and even after is something that we all know so well. During the times in which we sit in traffic, most of us have probably said out loud to ourselves, “I pay taxes, why don’t they just build another lane or two?” or “where is the mass transit system they promised?” or “I need to find a job that is closer to home.” Then, when the miracle does come and the construction begins, we experience more gridlock and congestion and we ask ourselves “will this be worth it?” or “When they do complete it, there will just be more traffic so we will do this all over again.”
This example may hit home with many companies when coming to the crossroads of when their applications, systems or infrastructure hits the proverbial wall or cliff. Most cringe when someone proposes “we have to rebuild”. The corner offices want to squeeze as much blood from the turnip as possible and the engineers would rather cut off their own arm than to refactor 5 – 10 year old code to squeeze out more performance especially when the creators are no longer with the company. So what does one do in this situation? There is no easy answer and it comes down to risk tolerance.
Scalemageddon
Several years ago, we were faced with this same crossroad. zvelo’s systems and AI algorithms were performing well and with the highest accuracy in the industry. Since we constantly measured our capacity for years, we knew that our cliff could be coming sooner than later, particularly in light of the increasing demand we were experiencing. We had to weigh the risks of increasing capacity from a strategic level, as well as the technical level. zvelo’s 50 lane highway was not yet to the point of a massive shutdown; however it wasn’t a matter of if, but when.
The internet continues to grow and our partners needed more features that resulted in higher volumes. Categorizing all the top level websites is not necessarily difficult however according to NetCraft (http://news.netcraft.com/archives/category/web-server-survey/) there are over 1 Billion websites on the web today and millions more added per month. With the addition of new gTLD’s as well as Internet of Things which may also have web server type capabilities, this trend doesn’t appear to have any slowdown in the future. It may plateau, however other challenges still exist such as how do you constantly keep categories fresh on dynamic content or data that changes daily? Popular social content, malicious payloads and inappropriate content are now served deeper into each and every website. Understanding what content is being served using the URL address or the domain name itself has become much more complicated.
When to refactor, rewrite, or rebuild – oh my!
Technically speaking, when one is in this position it typically comes down to the three choices; refactor existing code, rewriting existing code, or forget that and let’s rebuild. The three choices are not optimal, just like if someone asks you “Do you want to fight a Lion, a Tiger, or a Bear?” Which one would you pick? Facing any one of the three “R’s” doesn’t really matter at a technical level, because individually each one is big, have sharp claws, and have sharp teeth. So the decision involves many factors, however, the first question that must be answered is. “What is the risk tolerance of the company?”
This is the fundamental question the leaders of a company must ask and answer. It requires commitment and also honest realization of what the company is trying to achieve together. It must be baked into the company’s goal/vision and culture. It can’t be taken lightly, nor should it be “The Quest of El Dorado” or its “White Whale”. Only when the leadership of the company makes the decision and is committed to the decision, can the next steps of the journey begin.
Our Journey
Our decision was to build the new road next to the old road. We had to rebuild, it was a necessity due to what we were facing in the near future. The future we saw was one in which more and more websites and subpages of these web sites would have more and more content than ever before.
Volume at large scale was our biggest challenge and our cards were dealt to us when we decided to go into this business of providing web categorization and malicious web detection services. So refactoring or rewriting just didn’t make a lot of sense especially with newer technology and cost effective cloud infrastructure options that are more abundant today. Trying to bolt on or cover the cracks in the foundation was going to only delay the inevitable. We had to address the foundation and it had to be sooner than later.
Once the decision was made to rebuild and requirements were gathered for our new platform, we had three main questions we had to answer:
- What is the architecture? Onsite or Cloud? Virtual or Physical? Vertical or Horizontal?
- What should be our programming language of choice? What will help us achieve our goals (requirements) and what will fit the problem we are trying to solve effectively. Not future but now.
- Do we have the right people? Either existing people had to learn the new language/architecture or we had to find others who did.
Microservices style, the new black
I am sure most have heard the term microservices. If not, then in simple terms it is taking the UNIX philosophy and applying it to applications; build short, simple, clear, modular and extendable code. Since our main challenge will be volume or needing to scale, microservice architecture style was a no brainer. We were going to build services that did specific processing and communicated to other services with lightweight mechanisms. Using this style opens up the ability to scale horizontally, fix or enhance specific services without the need to change the entire system like you would in a monolithic style. It would be a bit presumptuous of us to think that if we use the microservice style that all of our problems would go away. We didn’t think this way and we understood in the beginning of what we would be in for during our journey.
GO West
Go (golang) is an open source programming language created at Google in 2007. At the time we started our journey, not many had heard of Go. When we interviewed new engineers, many stated “never heard of it but I am going to look into it now.” We looked at many languages, such as Java and C++, before selecting Go. We did a bake off that included a laundry list of criteria we wanted to have in our next language. We also started to do small proof of concepts (“PoC”) and tried to look at integrating other technologies. Once we finished these PoC’s, we found that Go became the clear winner for us.
With any language there are pros and cons, however, the team favored Go for a number of reasons. Due to it’s statically typed, garbage-collection, natively-compiled programming language and comes with concurrency as a built-in feature at the language level, it was a perfect choice for what we wanted to achieve. Once you learn the “Go Way” (i.e. the simple and minimalistic approach), you appreciate the language even more. The team could spend a lot time producing blogs around the ins and outs of Go, but those stories are for another time.
Ship it! – Containers
We needed a way to deploy to development, integration, stage and production environments easily. We had the mantra that we wanted our microservices to run anywhere and not be bound to an OS, or dealing with patching the OS, etc. Using containers was a perfect choice compared to virtual machines. Not only deploying to different environments was a necessity, but we also had to deploy globally quickly. We tried to avoid cloud vendor lock in as much as possible, however, eventually we had to tie ourselves into a particular cloud vendor to make everything work smoothly.
Docker doesn’t really need an introduction since it is now all over the place and has mass appeal among the Development and DevOps communities. The days of creating multiple dev environments for various developers are over with Docker. Each person can run the same Docker images on their laptop that is running on some virtual cloud server with ease. Eventually, we added Kubernetes for container management, deployment and scaling. For us, using Docker and Kubernetes gives us what we needed; simplicity, consistency and being able to rapidly deploy.
We the People
When traveling to a new land, the difficulties were finding the expert guides. Finding Go experts was no different. In the beginning of our journey, we had trouble finding anyone who knew Go, let alone had heard of it. We quickly adjusted our practices and instead of trying to find experts, we recognized had to grow them. We knew we needed to find people that had similar skills (since Go is similar to C) or the ability to pick new languages easily due to strong foundations already in place. It would take some time for everyone to get up to speed, however, as long as we focused on hiring the right people who had the right mindset, we had better chances than waiting for Go experts to magically appear.
The Results
The above journey transformed a team that would take months to add features to a monolithic system or provide temporary fixes so that the boat can stay afloat, to deploying services to days or a week to any number of environments. Once a service had been tested, we are able to deploy to multiple datacenters around the globe in minutes with the confidence that everything is consistent. With Go, Docker, and Kubernetes we are able to horizontally scale our systems to meet the demands. We played to our strengths and matched the tools to those strengths accordingly.
The journey was difficult and may not be for everyone. It was a tremendous amount of work in the beginning to achieve the results and often times many were discouraged with the speed of progress. However, since every leader on the team was committed to making it work, it did. Similar to the highway analogy, it takes a while to not only build the new road but it also takes time to ensure that everything around the road is up to specifications so that traffic can flow smoothly.
The first journey has been completed, however with leading edge technology the next journey is right around the corner.