In this episode of Semaphore Uncut, we talk to Vuk Gojnic. Vuk is Squad Lead for container and cloud-native engine at Deutsche Telekom Technik. Vuk describes how his internal engineering team is bringing Cloud Native infrastructure to the 200,000 person telecom giant. We discuss a number of technical challenges. We also address the issue of measuring the success of an internal engineering team. And we discuss the difficulty of hiring and training cloud-native specialists.
- Cloud-native in telcos — not straightforward
- Key principles for cluster management: declarative, immutable
- Leveraging existing IaaS using Cluster API
- Using Flux/GitOps Toolkit
- How to succeed with internal engineering
- Large scale, small team using GitOps
- Partnerships and education to access cloud-native talent
- The road ahead
Listen to our entire conversation above, and check out my favorite parts in the episode highlights!
Like this episode? Be sure to leave a ⭐️⭐️⭐️⭐️⭐️ review on the podcast player of your choice and share it with your friends.
Darko (00:02): Hello, and welcome to Semaphore Uncut, a podcast for developers about building great products. Today, I’m excited to welcome Vuk Gojnic. Vuk, thank you so much for joining us. Please feel free to go ahead and introduce yourself.
Vuk: My name is Vuk Gojnic. I’m currently working in Deutsche Telekom in Germany. I’m in the industry for more than 20 years. My educational background is telecommunications. But my private background, is as a programmer, as I started coding in high school.
Around 2006, ’07, I decided I want to orientate myself towards business. Then I did all sorts of stuff that are mostly on a managerial level and executive level, leading me to my previous position in the Deutsche Telekom group. Two years ago, I thought, okay, this was all interesting, nice, but I feel like a lack of hands-on work.
And this is how I ventured into my latest role, being a so-called squad lead. It’s called squad lead for container and cloud-native engine. It’s essentially our internal managed Kubernetes offering in Deutsche Telekom Technik.
Cloud-native in telcos — not straightforward
Darko (02:19): About Kubernetes and what you have been building with your team in Deutsche Telekom — can you give us an overview? What are the goals of your team, and how are you supporting the rest of the organization?
Vuk: We build and provide to other teams, the application teams and developer teams, infrastructure, and a platform to deploy and run their cloud-native applications. It all began beginning of last year, where we started getting internal demands and requirements. Some teams started already building and needing the platform to deploy their cloud-native applications. There was nothing ready at that time. Mostly the infrastructure of the day was virtualized infrastructure, virtual machines. It’s important to emphasize we, as a telco, are running mostly on-prem, especially in the unit where I’m working, which is the network technology unit.
So that’s very much dependent on geography, on the core locations, edge locations, and so on. And from that perspective, we got a lot of virtual infrastructures. You could get the virtual machines, deploy many things in virtual machines, but there was essentially no cloud-native infrastructure. So, our task was to establish that.
We realized that the best way would be to build on the readiness of Kubernetes for production. But the use case that we had to cover is not a straightforward one where you have one, two, three big data centers where you need to deploy a couple of big clusters. Our use case is actually based on a multi-cluster philosophy. Those Kubernetes could run in core locations, which are bigger data centers and in edge and far edge locations.
Key principles for cluster management: declarative, immutable
And when I say far edge locations, these are really the local nodes, for example, fiber optical broadband access, fiber to the home, or the remote mobile radio base stations. Or the aggregator sites for a couple of base stations in the 5G scenario. None of the solutions was really satisfying that use case in an efficient, light enough, and out of box manner. That’s why we decided to go for building an engine around a couple of principles, declarative, immutable, and GitOps managed. And this is what we released a couple of months ago. But we have a lot of still work ahead of us.
Darko (06:09): So, it is in production in some capacity right now, right?
Vuk: Yes, we essentially started with this immutable declarative concept built on Kubernetes managing the Kubernetes clusters. We have a notion of a management cluster. We heavily on a project called Cluster API, which is a part of the SIG Cluster Lifecycle group of the Kubernetes project. And this means that you get the management cluster, and in that management cluster, you deploy certain custom resources. And in that cluster, you also deploy certain Cluster API infrastructure providers, which are essentially the modules that can talk with the other specific infrastructures.
Leveraging existing IaaS using Cluster API
We started with a VMware based or vSphere based module. That means that we can define the clusters in a manifest, in Git, deploy that construct into a management cluster. The infrastructure provider will then talk via APIs with a remote vSphere platform, create machines, create nodes, configure networks, everything. Then what we call tenant cluster or workload cluster would run. This is the first step we implemented because simply the VMware, vSphere infrastructure was there. That’s the Infrastructure-as-a-Service platform that is in production for many years. We integrated with that one, and this is what’s in production.
We have a couple of clusters already running. Our first release of the engine happened sometimes around May timeframe. Since May we got a number of internal customers running and deploying their applications. They’re dealing also with steep learning curve when it comes to cloud native principles and how it’s different from the virtualized world. That’s why a lot of applications are in a half baked stage — until they really come to the production.
Darko (08:52): One detail, so you essentially have one more layer of abstraction? So, in that management clusters, are those things implemented like CRDs or it’s something more complex?
Using Flux/GitOps Toolkit
Vuk: First of all, the whole concept and architecture of Cluster API are based on a set of CRDs that are abstracting the definitions of the different aspects of the cluster and cluster creation. It gives us the possibility to define the entire cluster, entire infrastructure in manifest, in Git. We use GitOps agent Flux, and now we are switching to GitOps Toolkit, or Flux 2.0, which sits in the cluster and observes a certain Git repository.
And then when we commit into that repository, this Flux in the management cluster will take it up and implement those custom resources into the management cluster. Then they will trigger the controllers, which are behind these. These will kick in and start the workflows to create the tenant cluster. And then what will happen then when the tenant cluster is bootstrapped, it will also be bootstrapped with Flux inside. It will be pointed to certain other Git repo that’s for that cluster or directory. And will load various things like Prometheus, Grafana, CSI storage interface, and all other things that we also provide as a platform to our internal tenants.
How to succeed with internal engineering
Darko (10:36): You mentioned that your organization’s 200,000 people, the customer base will be very big eventually. So maybe we’d be interested to hear from you how launching that internal product works. How do you measure your success? They are not paying you, so you cannot measure by revenue, but adoption, how?
Vuk: I adopted one important metric for such cases when it comes to internal infrastructure within the company. This is a metric of how many unprompted thank-yous you get from the customers. When you help them run their things better, faster, cheaper, or even help to solve the problems.
You can have two approaches to launch the product. You can go top-down: do a lot of planning, a lot of PowerPoint, a lot of selling internally, and then launch it. Still, I also witnessed this way is usually having a lot of pitfalls. You miss some aspects, you miss that critical early feedback from the internal customers, you miss a possibility to adapt the things in an agile way, and so on. So what we are doing essentially is a bottom-up approach. We are releasing our platform or previews of our platform in relatively fast cycles. We started in March with a small release last year. Then had, in June, July, another small release, an MVP release for our current platform. Every couple of months, bringing something, getting the customers, collecting important feedback, training our feedback loops, and adapting things.
And for that, it was important for us to have a competent team on the ground who can actually act upon that feedback and not essentially deal and be as a requirement engineering team that is specifying the things and then shifting that to some partner or supplier, or something like that.
Large scale, small team using GitOps
Darko (16:48): And in terms of technical challenges during these two years that you have been molding Kubernetes and preparing it for your needs, you mentioned previously that GitOps was one of the practices that you adopted. Can you maybe give us more insights into those technical challenges you faced and how you resolved them?
Vuk: One of the things that were clear to us at the start: we need to manage a large number of clusters spread across multiple infrastructures. Another constraint we had is we had to do it with a limited number of people. And that became clear that we need an architecture that would allow us not to manage these clusters. If you need actively to manage something with 10, 12 people, then you need to find out the architecture to give the engine or give the solution that you are building most of the tasks and most of the responsibility to manage all of that. This was the first challenge on the crossroads between technical and organizational. And there, we actually spent some time researching and looking what’s the best formula to do that, and we got a couple of inspirations.
When we connected Cluster API and GitOps, this gave us the idea that this could be on autopilot. This could be something that can be the foundation for us to build a platform that can manage itself.
Partnerships and education to access cloud-native talent
The second challenge was how to have access to enough talent that can put that together, and this is where we naturally partnered with a company, Weaveworks, which is one of the thought leaders in the GitOps practice. Essentially they coined that term and construction called GitOps. We connected with them and partnered, and put together a team of fantastic, brilliant young people. Team of the Weaveworks guys, also their experience in their products and open source projects. And this was how to leverage partners to solve that challenge because access to the talent in a cloud-native technology is really much limited. I have to say openly, there is a real war for that talent, and you need to leverage different approaches.
Darko (22:36): It seems that quite a burden of educating fell on your team. What do you see as maybe a progression in that area? Were you able to scale that education program, or would you need to spawn another team?
Vuk: Actually, that’s an excellent question. And we have not yet found the good formula, that’s an aspect that takes a lot of time. So we know people who are having onboarding sessions to Kubernetes twice a month. They are repeating, more or less, the same or similar stories over and over to the new hires. So we are not having, at the moment, a structured approach. We are helping mostly educate people by doing. When they face the issue, we come and try to explain a bit around. That takes a lot of time, and scaling that is the real question. However, on the other side, there are some very innovative approaches being developed in the market.
The road ahead
Darko (25:01): A period of two or three years is ahead of you to spread this infrastructure across the company. And I really hope, as it progresses more, we’ll be able to hear a conference talk from you with how it went on a large scale.
Vuk: Yeah, so we are actually also aiming to contribute back several things to the community. So that could be a good occasion where we are maybe spreading more details about that. We are not there yet, but we are very well driven by that desire to also achieve the contribution back.
Darko (25:55): Thank you so much for this talk. It was very interesting discussing these topics, and I hope it will be interesting for our listeners also.
Vuk: Thank you as well, and looking forward to follow up on the progress in the industry overall.
Originally published at https://semaphoreci.com on February 2, 2021.