Being an engine
Hello, I’m Cola from MeetsMore, a new member of SRE team in foundation department. As a home-lab enthusiast passionate about managing infrastructures and services, my journey started with this sentence:
“Building the foundation for the next decade.”
Our mission aligns closely with the principles outlined in the SRE Book. We're dedicated to enhancing the developer experience and making their processes more efficient and less cumbersome.
I'd like to share some insights about the role of bots in our daily operations and my perspective on efficiency, focusing on two key areas:
- “True North”: Our primary goal is to expedite the delivery of code to create tangible user value as swiftly as possible.
- “Less waste”: Emphasis on reduction of waste, not just budget: If we invest $1 brings back $10, why not invest more?
Simplify with technologies
In our workflow, numerous pull requests are merged into the main branch daily. However, we only consider a delivery complete once the release PR is successfully merged. This poses a question: Who is responsible for creating this crucial release PR? Typically, a team member undertakes this task, which can potentially stall the value stream. Addressing this bottleneck was my initial assignment.
To streamline this process, I developed a bot that simplifies the creation of a PR by utilizing the GitHub API through AWS Lambda. This automation has significantly improved our workflow.
Now, We have two scheduled tasks to create the release PR every workday. Initially, this system functioned smoothly, but we soon encountered a hiccup: occasionally, a PR would be held up due to pending approvals from certain reviewers. To mitigate this issue, we introduced a release notifier bot in our Slack channel, which serves to remind reviewers to complete their assessments promptly.
Forgetting who's on-call is a frequent occurrence in our team. Typically, we either get notified by the last on-call person or have to manually check the schedule when an alert pops up. To streamline this process, we developed an on-call scheduler bot. This bot manages and updates the current on-call group based on the PagerDuty schedule.
With this implementation, team members or our alert bot can simply tag the designated group name to notify the on-call person. This ensures that incidents are addressed promptly and efficiently.
The final bot I'd like to discuss plays a crucial role in mitigating alert storms. We faced a challenge in our communication channel integrated with the DATADOG bot, which bombarded us with separate notifications for alerts, warnings, and recoveries. This made it difficult to track which alerts were resolved and which were still pending. Here is the solution we devised:
- Integration with DATADOG: We began by adding a webhook in DATADOG designed to transmit data upon the triggering of a recovery event. This webhook is linked to the URL of our Lambda function.
- Lambda Function Logic: Within this function, the logic is straightforward yet effective. When a recovery message is received, it parses the content, then utilizes the Slack API to retrieve the corresponding alert or warning message in the channel. We use regex to match and extract the original message's timestamp for reference.
- Slack Interaction: The bot then replies to the recovered alert in a Slack thread, marking it with a green check to indicate resolution.
An additional advantage of this system is the significant reduction in channel noise, which not only declutters our communication but also serves as an educational tool for new team members. They can more easily understand how to manage alerts, locate debugging information, and so forth.
Most of our bots are powered by serverless Lambda functions, but there are areas ripe for improvement. These include optimizing runtime duration, enhancing status management within Lambda, and potentially utilizing the ARM64 architecture.
A notable example is one of our Lambda functions that currently takes over 10 seconds to execute. By leveraging DynamoDB or S3 for managing the status of Lambda functions, I foresee a potential reduction in runtime by up to 50%.
Furthermore, the management of these Lambda functions merits attention. We are currently employing CDK-TF to manage all our infrastructure, including Lambda functions, in a GitOps manner. This approach significantly streamlines our workflow, boosting both ease and efficiency.
A bot serves as a gateway to a service, allowing us to interact seamlessly within Slack to accomplish tasks efficiently. Currently, our workspace utilizes 175 bots, and this number is only set to grow. I anticipate a continued increase as we integrate more bots designed to automate repetitive tasks and provide self-service options for engineers, further enhancing our productivity and convenience.