Our team manages a decade old system consisting of 9 custom microservices. Due to a recently increased amount of time spent reacting to problem reports, we decided to articulate a common set of system health assertions spanning an end-to-end view of the system. Based on those assertions, the team developed an automated set of reports to validate system health on a daily basis.
These daily system health reports have created accountability and have focused team members on the most important operational issues to address. The reports have inspired the creation of several tools to resolve and prevent operational problems.
Our Product Manager, Technical Lead and Software Development Team Manager, will share the story of this transformational initiative from their individual perspectives.