We are looking for a skilled and driven Technical Software/Support Engineer (Operations) to join our team. In this role, you will drive our operations and incident management initiatives, ensuring our systems remain robust, scalable, and resilient at scale. You will work closely with cross-functional teams to identify operational gaps and implement solutions that enable seamless deployment, observability, and maintenance of our system
Key Responsibilities
Incident Management & Response (60%)
Lead/contribute to incident response efforts during critical system outages and performance degradations
Develop and maintain incident response procedures, runbooks, and escalation protocols
Conduct thorough post-incident reviews and drive implementation of preventive measures
Coordinate cross-functional teams during high-severity incidents
Build and maintain incident management tooling and automation
Manage stakeholders expectations
System Operations & Reliability (20%)
Design, implement, and maintain monitoring, alerting, and observability across our system
Develop automation tools to reduce manual operational overhead
Ensure system SLAs and SLOs are met consistently
Software Development (10%)
Build internal tools, APIs, and platforms to improve operational efficiency
Create dashboards and reporting systems for operational metrics
Collaboration & Process Improvement (10%)
Partner with development teams to improve system reliability and operability
Establish and refine operational processes and best practices
Mentor team members on incident response and operational procedures
Participate in on-call rotation and provide operational leadership during incidents
Drive continuous improvement initiatives based on operational data and feedback
Required Qualifications
Technical Skills
5+ years of software engineering experience with a focus on operations
Proficiency in at least one programming language (Python, Java/Kotlin, TypeScript or similar)
Experience in modern web application technologies/tools such as PostgresDB, Kotlin, AWS
Knowledge of CI/CD pipelines and deployment automation
Experience with AWS and container technologies (Docker, Kubernetes)
Understanding of monitoring and observability tools (Prometheus, Grafana, ELK stack, or similar)
Experience with APM tools (New Relic, Datadog, AppDynamics)
Experience with infrastructure-as-code tools (Terraform, Ansible, CloudFormation)
Background in DevOps or Site Reliability Engineering practices
Experience with log aggregation and analysis tools
Understanding of security operations and compliance requirements
Contribute to system architecture decisions with operations considerations in mind
Operational Experience
Proven experience in incident management and response procedures
Experience with on-call responsibilities and escalation processes
Understanding of system reliability concepts (SLAs, SLOs)
Knowledge of networking, security, and database administration concepts
Experience with configuration management and deployment strategies
Soft Skills
Excellent problem-solving and analytical thinking abilities
Strong communication skills for technical and non-technical audiences
Ability to work effectively under pressure during incident situations
Collaborative mindset with cross-functional teams
* Detail-oriented approach to documentation and process improvement
Beware of fraud agents! do not pay money to get a job
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.