The rise of microservices, cloud platforms, containerization, and distributed systems has made computer systems incredibly powerful, but also increasingly intricate. Ensuring high availability in these complex environments requires innovative solutions. Enter Coroot, an open-source observability tool designed to simplify monitoring and prevent chaos in production. Coroot co-founders Peter Zaitsev and Nikolay Sivko shed light on their vision: empowering engineers to manage this complexity, maintain system uptime, and swiftly resolve issues.
Edited transcription
What is Coroot?
“We have a lot of open source and commercial tools that can help you in gathering telemetry data, building your dashboards, alerts, and so on, and we decided to look at the problem from different angles,” says Nikolay Sivko, CEO and co-founder of Coroot, an open-source observability tool. As systems are now more complex than ever (consisting of “microservices, architectures, hundreds or even thousands of services”), Coroot “aims to help engineers to manage this chaos in the production environment,” Nikolay explains.
To this end, Coroot is meant to provide not a cascade of insight, but just the right amount to help solve problems without overwhelming the user. As such, Coroot focuses on delivering actionable answers to user questions, providing information on what’s happening within the system, including errors and their root causes. In other words, Coroot’s bet is on actionability rather than comprehensiveness. “Being able to solve 95% of the problem is much better than having information to solve 100% of the problem, but being overwhelmed and not being able to do that,” says Coroot Co-founder Peter Zaitsev.
“The different approach here is working not with telemetry data directly, but with a model of a system,” says Nikolay. Coroot relies on fine-grained metrics, logs, and traces to build a comprehensive system model that takes into account all components working together, including containers, underlying nodes, network communication, and latency. “Analyzing such a model allows us to provide you with insights,” says Nikolay.
The Power of eBPF
A core strength of Coroot is its ease of deployment, compared with other observability tools that require a complex installation and configuration for each specific use case. This is possible thanks to eBPF (Extended Berkeley Packet Filter). As Nikolay explains, “eBPF allows you to inject your small programs into any particular place of the kernel so we can instrument network connections, capture application-level queries, requests, and so on.”
eBPF works without code changes in the kernel itself. Its programs typically store captured data in a limited memory space (like a ring buffer) within the kernel. A separate user-space program then retrieves and processes this data, ensuring that even if the user-space program can’t keep up, the kernel buffer won’t overflow and crash the system.
What’s more, eBPF collects data with minimal performance impact, allowing Coroot to gather essential data with minimal setup, providing initial valuable insights “out of the box.” eBPF programs are verified before execution, guaranteeing they cannot introduce problems like infinite loops that could crash the system.
Embracing OpenTelemetry
Simultaneously, Coroot uses OpenTelemetry, a vendor-neutral framework that allows users to avoid vendor lock-in with telemetry tools. OpenTelemetry simplifies collecting telemetry data (metrics, logs, traces) and sending it to various systems without code changes. According to Nikolay, implementing OpenTelemetry “is a good shift on the market because you can easily replicate your telemetry data flow into many systems and you don’t need to change your code anymore; once instrumented, it can be supported by many vendors, including Coroot.”
Failure-Driven Development (FDD)
Coroot uses a reactive development approach, where encountered failures inform future development: “We use failure-driven development because we love to reproduce failures and then build a product that can detect it,” says Nikolay. “We produce various scenarios of failure and build products that highlight the most time-consuming piece of your code and the root cause of errors from distributed cases,“ he adds.
On the other hand, the development team has no specific emphasis on unit tests. Unit tests might be written situationally. Due to the complex interactions between different Coroot components (node agent, exporters, core engine), integration testing is the primary focus.
Community and development
“Now the majority of code is written by the team because we still are implementing our vision,” says Nikolay. This vision is still evolving but prioritizes features that directly address user needs. In this regard, Coroot plans to expand its deep-dive capabilities beyond Postgres, including highlighting resource-intensive queries in other databases and queue systems.
Still, while the core development is internal, Coroot welcomes contributions for smaller features like webhooks, plugins for specific databases, or integrations with other cloud providers. As for future features, Nikolay explains, “We are focusing now on simplifying the installation process because we want to provide users with a solution that can suggest to you what else you need to instrument.”
The bottom line
Trying Coroot is free and requires minimal setup. There’s a SaaS version available for those who prefer a cloud-based option and the possibility of self-hosting for those comfortable with Kubernetes. For a quick and easy setup, Nikolay and Peter recommend installing Coroot directly within your Kubernetes cluster using Helm.
The initial Coroot setup provides immediate value by offering insights into your entire system, including control plane components and all services, without additional configuration.
Additionally, Coroot welcomes feedback to refine its product vision. Users can engage through various channels:
While Coroot initially targeted Kubernetes environments, it has broadened its reach. The goal is to support various infrastructure components, including legacy databases, virtual machines, and managed services like AWS RDS and Elasticache: “We aim to support all the components to provide users with 100% coverage of the infrastructure because it’s super important to be able to see the big picture,” says Nikolay.
Originally published at https://semaphoreci.com on June 4, 2024.