The Forge¶

TLDR¶

Launched an initiative to improve internal workflows for engineers. These included centralised alarms for infrastructure drifts and failing deployments, command line tooling for consuming messages from SQS queues and Kafka topics, centralised feature flag reporting, etc.

My role¶

Technical Lead of the Platform team, which was best suited for implementing a change like this.

Context¶

The following is taken directly from the announcement page for the initiative.

🚀 Roadmap¶

The goal of The Forge is simple:

Enhance engineering efficiency at ER by reducing friction, improving developer experience, and equipping product teams with the tools they need to focus on building great features.

To kick off the initiative, team RPD will roll out the following. The software built next as part of The Forge will be decided in the collaborative sessions mentioned above.

What Topic(s) Batch

Alarms for continuously failing ECS deployments cicd, monitoring #1

Alarms for cloudformation stacks going out of sync with code cicd, monitoring #1

Tooling to interactively poll for messages in our Kafka cluster dev-tool, observability #1

Tooling to interactively poll SQS queues dev-tool, observability #1

Batch #1¶

Alarms for continuously failing ECS deployments¶

Currently, we don't receive immediate notifications if a faulty configuration causes ECS to repeatedly spawn new (failing) tasks. Ideally, teams should be alerted as soon as an ECS task (whether newly created or existing) fails. While individual alarms can be set up using CloudWatch, The Forge will provide a centralized solution that requires no additional effort from any product team.

Alarms for cloudformation stacks going out of sync with code¶

Most of our AWS infrastructure is provisioned using CloudFormation, with the template files committed to the source code of individual services. However, there is no automated mechanism to ensure these template files remain in sync with the actual stacks. This can lead to situations where an engineer unintentionally breaks a CloudFormation stack update because the template file was outdated.

The ideal solution is to treat template files as the single source of truth for CloudFormation updates. However, since this requires some work to be put in from all teams, The Forge will introduce safeguards to lower the probability for failures related to this: Alarms will be set in place to notify teams whenever their stacks fall out of sync.

Tooling to interactively poll for messages in our Kafka cluster¶

A large part of system-to-system communication at ER happens via protobuf-encoded Kafka messages. While efficient, the binary encoding makes it hard to inspect the contents of these messages — which is something engineers need to do every now and then — in a human-readable format. The Forge will provide pre-configured tooling to enable all engineers to easily and interactively introspect these messages.

Tooling to interactively poll SQS queues¶

Plenty of systems are ER still share events via the means of SNS topics and SQS queues. Similar to the kafka messages mentioned above, engineers also need to inspect the contents of these SQS messages every now and then. The Forge will provide all engineers access to pre-configured tooling that will allow interactive and easy introspection of these SQS messages.

the-forge-landing-page

Here are some of the dashboards that were created:

service-versions-1 service-versions-2 infra-checks-1 infra-checks-2 toggles

What	Topic(s)	Batch
Alarms for continuously failing ECS deployments	cicd, monitoring	#1
Alarms for cloudformation stacks going out of sync with code	cicd, monitoring	#1
Tooling to interactively poll for messages in our Kafka cluster	dev-tool, observability	#1
Tooling to interactively poll SQS queues	dev-tool, observability	#1