Pfizer’s mission to deliver breakthroughs that change patients’ lives is rooted in our commitment to science and innovation. Within Discovery, Preclinical, and Translational Solutions (DP&TS), we accelerate the journey from target identification to clinical translation by leveraging advanced digital technologies, AI, and data-driven insights.
We’re building a forward-thinking platform engineering team dedicated to delivering secure, scalable, and resilient infrastructure. As a Site Reliability/Operations Engineering Lead, you’ll play a pivotal role in ensuring the reliability, performance, and operational excellence of our cloud-native platforms.
This role is perfect for a high-caliber, well-rounded generalist who thrives in dynamic environments, takes initiative, and enjoys solving complex problems across infrastructure, automation, and observability. You’ll be joining a team that values curiosity, collaboration, and continuous learning. While we expect you to take ownership and solve meaningful problems, you’ll be supported by a friendly, inclusive environment with clear goals, strong mentorship, and a culture of shared success. We believe in setting our team up to thrive—not just deliver.
ROLE RESPONSIBILITIES
Each day brings a mix of routine and challenge. You’ll start by reviewing system health and platform metrics, ensuring everything is running smoothly. From there, you’ll collaborate with teammates to prioritize work—whether it’s deploying infrastructure, refining automation, or addressing a new technical challenge. Some days you’ll be deep in Terraform modules or Kubernetes tuning; other days you’ll be troubleshooting unexpected issues or helping a colleague solve a complex problem. You’ll have space for focused work, but also plenty of opportunities to engage with others, share ideas, and contribute to solutions that drive real value. The environment is fast-paced but supportive, and your contributions will directly impact the reliability, scalability, and security of the platform.
Key Responsibilities
- Ensure high availability and performance of cloud infrastructure and services (AWS, Azure)
- Build and maintain monitoring, alerting, and observability systems (e.g., Prometheus, Grafana, ELK)
- Automate operational tasks using Terraform, Ansible, and scripting languages
- Manage incident response, root cause analysis, and postmortems
- Collaborate on CI/CD pipelines and deployment strategies using **GitHub Actions**
- Maintain and improve container orchestration platforms (Kubernetes, Docker)
- Administer systems, databases, and networks with a focus on reliability and security
- Implement and enforce security and compliance best practices
- Continuously evaluate and integrate tools to improve operational efficiency
- Lead and grow a high-performing team of reliability and operations engineers
|
BASIC QUALIFICATIONS
PREFERRED QUALIFICATIONS
|