What are the responsibilities and job description for the Sr. Site Reliability Engineer-Production Engineering position at Addepar?
The Role
We are looking to add a new stunning colleague to the organisation driving a transformation of Addepar’s Production engineering team towards a platform enabling high-level declarative infrastructure orchestration and its operations. Said platform closely integrates our Compute, Network, and Storage control planes allowing us to evolve blueprints of highly efficient and fast to iterate-on services tailored to various product areas within the company, abstracting our developers from the nuances of underlying infrastructure.
The ideal candidate will play a leading role in implementing and maintaining Addepar Production Infrastructure administration by bringing in a combination of leading innovative solutions across functional teams and hands-on development experience in AWS/cloud, Linux/Unix, networking, scripting abilities, containerisation, Kubernetes, Terraform, Information security, debugging and monitoring/observability skills to design, deploy, monitor and automate all operational aspects of Addepar's platform.
Must-have Skills
- Recent and proficient experience with Java, Python, Go, or similar.
- Recent and proficient experience with Terraform and IAC.
- Experience building & operating highly reliable distributed systems in a cloud environment.
- Passion for technology, pragmatic thinking, ability to jump into an ambiguous area and break down complex problems.
What You’ll Do
- Using Kubernetes, k8’s and maintain or operationalise container infrastructure
- Design, build, and maintain automated CI/CD pipelines using Jenkins, ArgoCD, AWS Code build/Pipeline, GitHub Actions or similar.
- Deploy and maintain Kubernetes and related technologies as part of App deployments to various Clusters.
- Use Terraform for developing, operationalising and evangelising infrastructure as code for Scaling Addepar Platform across regions.
- Operationalise and evangelise application and infrastructure upgrades/patches.
- Gain deep application-level knowledge to inform infrastructure requirements and constraints to Developers, QA and Management, where by implementing dashboards for Cost and Inventory management.
- Monitor and troubleshoot our infrastructure or App stack using Logging/monitoring tools.
- Collaborate with cross functional teams to identify and resolve Application or infrastructure issues.
- Work with engineering and operations teams to improve, document, and establish processes and broadly improve the operability and security of our systems.
- Participate in on-call rotation and contribute to resolving Incidents.
- As a Senior Engineer, you will be expected to mentor more junior engineers as well as serve as a contributor to the engineering culture of the SRE team.
Who You Are
- Ideally you'll have a Bachelors/Graduate degree in Computer Science or related field
- Extensive experience in the SRE/DevOps/Systems Engineer field.
- Cloud Infra fundamentals (we use AWS)
- Strong Programming/Scripting in various common languages, (we use python [boto3], bash, and general UNIX tools, java is a plus)
- Broad and deep experience with any applied aspect of UNIX/BSD/Linux internals. (we use Ubuntu)
- Containerisation experience with k8’s (we use KOPS,EKS,ECS)
- Networking fundamentals, IPv4,v6 etc (AWS VPC a plus)
- Demonstrable experience with infrastructure-as-code tools such as Terraform
- Experience with monitoring and alerting tools such as Prometheus, Grafana, Sentry, Sumologic or AWS cloud native tools.
- Good interpersonal skills to collaborate with multi-functional teams
- Demonstrable experience writing systems automation tooling is a plus, (if you have open source code to share we're happy to discuss).
- Experience administering large scale Databases, Aurora Mysql, Mongodb is a plus
- Experience with Upgrading/Patching Vendor tools is a plus
- Exposure to industry practices in financial services is a plus