Software Development Engineer
GoDaddy
About the role
Location Details:
At GoDaddy the future of work looks different for each team. Some teams work in the office full-time, others have a hybrid arrangement (they work remotely some days and in the office some days) and some work entirely remotely.
This is a remote position, so you’ll be working remotely from your home. You may occasionally visit a GoDaddy office to meet with your team for events or meetings.
Join Our Team
GoDaddy is looking for a Software Development Engineer / Site Reliability Engineer to join our Monitoring and Observability team. In this hybrid SDE+SRE role, you'll design and build scalable software solutions while also owning the reliability, performance, and availability of systems serving millions of customers worldwide. You'll focus on developing high-quality applications and platforms that enable proactive monitoring, deep insights, and rapid troubleshooting — and you'll go a step further by operating those platforms, responding to incidents, and driving continuous reliability improvements across cloud and on-prem environments.
What you’ll get to do...
Software Development and Platform Engineering
Design, develop, and maintain scalable observability and monitoring platforms using Python and modern software engineering practices, including systems for metrics, logging, tracing, and visualization (e.g., Loki, Grafana, Tempo and Mimir(LGTM), Prometheus, ICINGA2, Site24x7 and BigPanda).
Build and enhance production-grade software services, APIs, and tooling that improve system visibility, reliability, and developer experience.
Collaborate with cross-functional teams to define requirements, architect solutions, and deliver robust, maintainable code.
Develop automation and self-service tools to streamline workflows and improve engineering productivity.
Implement and evolve infrastructure-as-code and configuration management using tools such as Terraform, Ansible, Puppet, or Chef.
Manage and troubleshoot containerized workloads across Docker, Kubernetes (including EKS/ECS), and Fargate, ensuring configuration consistency and operational reliability.
Contribute to system design, code reviews, testing strategies, and performance optimization for large-scale distributed systems.
Support and enhance CI/CD pipelines, ensuring efficient, high-quality software delivery.
Observability and Monitoring
Implement SLIs, SLOs, and error budgets to define and track service health and reliability targets, balancing reliability with feature velocity.
Build and maintain dashboards and alerting that provide actionable insights and minimize alert fatigue; champion SLO-based alerting and noise reduction.
Reliability and Incident Response
Respond to automated alerts and production incidents, participating in on-call rotations supporting global operations.
Partner with engineering teams to resolve availability, performance, and security issues.
Lead blameless postmortems and root cause analysis (RCA), converting findings into dura