Senior Incident Manager
monday.com
Senior Incident Manager
- R&D
- Tel-Aviv, Israel
Description
We are monday.com, a global software company transforming how businesses run. Our product suite can adapt to the needs of diverse industries and use cases within one powerful platform, empowering ~270,000 customers worldwide to reimagine how work gets done, drive greater efficiency, and scale like never before.
With over 2,800 employees worldwide, we grow by prioritizing transparency and knowledge sharing. We care about the impact you make, not the hours you clock, so we encourage initiative, ownership, and fresh thinking. We back our people with flexible work, wellness and mental health support, and a work environment built on collaboration.
#LI-DNI
The Opportunity: Build Our Command Center
At monday.com, we are scaling rapidly, and with that growth comes the complexity of maintaining world-class reliability for our customers. We're looking for a strategic, process-oriented leader to transform our incident management practice.
This isn't just about fighting fires; it's about building a world-class fire department.
You will be our "Fire Chief" - the strategic owner of the entire incident response program. Your mission is to evolve our response from a reactive, chaotic process into a calm, structured, and highly effective practice that minimizes customer impact and maximizes our learning from every incident.
You are the calm in the storm, the coach for our commanders, and the driving force behind our continuous improvement.
About The Role
- Own the Program: Own and evolve our end-to-end incident management framework, including all associated policies, processes, and tooling for our 600+ person engineering organization.
- Train Our Responders: Develop, manage, and lead a comprehensive training program for our rotational, on-call Incident Commanders. You will be responsible for ensuring our commanders are confident, capable, and ready to lead under pressure.
- Champion Blameless Learning: Drive a healthy, blameless post-mortem culture. You will facilitate post-incident reviews for major incidents, ensuring root causes are identified and that actionable, high-quality follow-up items are tracked to completion.
- Drive with Data: Define, track, and report on key reliability metrics (MTTR, MTTA, incident frequency, etc.). Use this data to identify trends, pinpoint systemic risks, and advocate for strategic reliability initiatives.
- Refine Communication: Partner with our technical and corporate communications teams to refine and execute our internal and external communication strategies during incidents, including the effective use of our public status page.
- Improve Readiness: Proactively improve our operational readiness by designing and facilitating "Game Day" drills, chaos engineering experiments, and other readiness exercises.
- Manage the Toolchain: Own the administration and optimization of our incident management toolchain (e.g., PagerDuty, Incident.io, Statuspage).
- Be the Strategic Leader: During major incidents, you will act as a strategic advisor and coach to the on-duty Incident Commander, ensuring the process is followed and removing organizational roadblocks.
Requirements
- Experienced Hand: 5+ years of experience in a relevant field such as Site Reliability Engineering (SRE), Technical Program Management (TPM), DevOps, or a dedicated Incident Management role.
- Proven in the Trenches: You have direct, hands-on experience managing and participating in major technical incidents for a large-scale SaaS or cloud-based platform.
- A Natural Leader and Coach: You have experience leading under pressure and a passion for training and mentoring others. You lead with influence, not just authority.
- Process-Driven: You excel at creating, documenting, and implementing scalable processes that reduce cognitive load for teams in crisis.
- Calm and Communicative: You possess exceptional communication and interpersonal skills and have a proven ability to remain calm, focused, and effective in high-pressure situations.
- Culturally Savvy: You are deeply committed to fostering a blameless, learning-oriented culture and understand the human factors involved in incident response.
- Technically Credible: You have sufficient technical depth to understand complex distributed systems and facilitate deep technical conversations between Subject Matter Experts without needing to be the expert yourself.
Bonus Points
- Experience managing or contributing to a formal Change Management / Change Enablement process.
- Experience building an Incident Commander training program from the ground up.
- Familiarity with modern incident management automation tools like Incident.io, FireHydrant, or Blameless.