Posted
Ref: PP000-38889
Job description / Role
Full Time
Saudi Arabia
Any Nationality
Not Specified
Not Specified
Not Specified
IT - Software & Web Development
IT, Software & Internet Services
Description
We are looking for a senior site reliability engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure.
You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale.
You’ll be hands-on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self-healing environment.
This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost-efficiency in production.
Responsibilities
Platform & Infrastructure Reliability
- Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
- Build self-healing, auto-scaling systems that minimize toil and manual intervention.
- Optimize networking, ingress/egress traffic control, and service mesh for secure and performant communication.
- Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments.
- Own backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets for critical data services.
- Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
- Troubleshoot and recover Kubernetes persistent volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues).
- Secure and scale object storage platforms (e.g., MinIO/S3-compatible) and integrate with workloads for high-throughput data pipelines.
- Work with block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost.
Automation & Delivery
- Champion GitOps and CI/CD best practices (ArgoCD, Flux, GitHub Actions).
- Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators.
- Reduce release risk through progressive delivery strategies (blue/green, canary, spot instance rolling updates).
Observability & Incident Response
- Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch).
- Lead incident management and postmortems to prevent recurrence.
- Provide real-time visibility into system health, performance, and cost metrics.
Security & Compliance
- Implement least-privilege IAM policies, secure service-to-service communication, and network ACLs/firewalls.
- Enforce Kubernetes RBAC, secret management, and secure image supply chain.
- Participate in audit readiness and compliance efforts.
Performance & Cost Optimization
- Analyze and tune system performance under scale (CPU/memory/IO).
- Partner with product and platform teams to right-size clusters, databases, and storage tiers.
- Introduce cost visibility dashboards for engineering leadership.
Preferred Qualifications
- Experience managing mission-critical systems at scale (high traffic, multi-region).
- Proven cost optimization in cloud/Kubernetes environments.
- Familiarity with service mesh (Istio, Linkerd) or advanced networking/egress control.
- Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required.
- Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.
Requirements
- 8+ years in SRE/DevOps/Infrastructure engineering roles.
- Deep Kubernetes expertise (multi-cluster, Helm chart development, advanced networking).
- Strong GitOps workflows using ArgoCD/Flux.
- Expertise with AWS (preferred) or Azure/GCP, plus Infrastructure-as-Code (Terraform, Pulumi, CloudFormation).
- Advanced knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
- Scripting/automation skills in Python, Bash, or Go.
- Solid background in monitoring/observability (Prometheus, Grafana, Loki, ELK/OpenSearch, VictoriaMetrics).
- Experience with CI/CD at scale and managing production incidents.
- Experience with streaming/messaging (Kafka, RabbitMQ, or similar).
Benefits
- Comprehensive training and development programs.
- Performance-based bonus incentives.
- Flexible work from home options.
|
Network Security Engineer - Illumio
Vega International |
UAE | 30 Sep |
|
|
DevSecOps Engineer
Saudi Networkers Services |
Riyadh | 25 Sep |
|
|
Senior Infra And Security Architect (Infrastructure, Security & GRC)
Ashghal (Public Works Authority) |
Qatar | 12 Nov |
|
|
Senor ML Engineer
RTC-1 Employment Services |
Kuwait | 19 Sep |
|
|
Cybersecurity Analyst
Charterhouse |
Abu Dhabi | 20 Oct |
|