What are the responsibilities and job description for the Site Reliability Engineer position at Tentek, Inc.?
****Hybrid (some sites require 2 days, others 3 days, and yet others 4 days onsite depending on location. Candidates may sit in either Anaheim, CA, Glendale/Burbank, CA, Orlando, Florida, or Seattle, WA
W2 candidates only - No 1099 or C2C candidates please!
The role will support/cover chaos testing for R project to increase coverage of the DX portfolio and roll out resiliency mitigations.
Top skills sets needed: Build/Release, Unix System Administration, IaaS (Terraform, Helm, or Chef), experience launching products in a variety hosting solutions including Google, AWS, Azure, SalesForce) and private cloud systems), Experience with chaos testing and relevant software (Gremlin, FIS), Golang or Python, Node.js, Java, CI/CD (Jenkins or Gitlab), Networking basics, OAuth2, etc.
Manager Call Notes:
The following is additional info on this position from a qualifying call with the manager:
The work locations candidates may reside in are Orlando (FL), Seattle (WA), Glendale (CA) and Anaheim (CA).
• This is a hybrid work schedule ranging from 2 – 4 days depending on the policy of the work location.
Some sites require 2 days, some 3 days and Amy believed that Glendale may require 4 days/week onsite.
• Candidate should have an operational background and have good comm skills.
• The candidate will be working with both product and application teams.
• Chaos testing experience is highly preferred but not mandatory.
Experienced with Gremlin or FIS is a plus and experience with other chaos testing software is acceptable, but more importantly is that the candidate understands the concept of chaos testing.
• Preferred programming languages are Goland or Python and the manager would like someone close to intermediate level.
• All of other key requirements are your typical SRE skills: CI/CD, Kubernetes, Docker, Terraform, Build & Release, Cloud proficiency with either AWS, Azure or GCP (AWS would be preferred), ECS, Monitoring tools (Splunk, AppDynamics, Grafana, Prometheus, etc), Helm, Chef
• Candidate may be required to be on call on a rotating basis.
EXTERNAL JOB DESCRIPTION:
Our Mission statement
● Reduce/Eliminate Guest Impacting Incidents/Outages across the Guest Experience portfolio
● Allow the product teams to focus on development and enhancement of our Products
Qualities we are looking for:
● You like working with clients - you will work with customers/product engineering to gather requirements. You like hearing stories.
● You have a passion for improvement - you have passion for improving processes (e.g. through less code, fewer manual steps, fewer systems, improving velocity).
● You are law-abiding but agent-of-change - you will advocate compliance with known standards and engage engineers to improve upon processes
● You are a team player - you mentor others and contribute support documentation; here, heroes work at enriching the team
● You can multitask - you are action oriented, capable of working concurrent projects
● You have a developer mindset and are comfortable writing code
● With an operations mindset you have some experience in maintaining production systems
Expectations
In this job you are:
● Responsible for creating breakdown of tasks to meet project objectives
● Responsible for on time ticket and task completion
● Responsible for turning strategy into multiple project objectives
● Responsible for sharing their work/experiences with the greater org
You will:
● Create/maintain/improve/troubleshoot SDLC pipelines
● Create/maintain/improve/troubleshoot monitoring technologies
● Create/maintain/improve/troubleshoot infrastructure technologies (cloud and on prem)
● Create/maintain/improve documentation on the technologies that the team builds
● Shadow operation and engineering team members in their areas of subject matter expertise
Basic Qualifications
● Have expert Build/Release skills - you will work with product development teams across the enterprise to test in code delivery SDLC pipelines
● Have expert monitoring skills - you will work on ensuring the tools that keep monitoring are up and effective at notifying guest- facing issues.
● Have expert team communication skills - you will work to ensure that the larger team understands and approves of their solutions.
● Have expert technical fundamentals - you must have expert level command of Unix System Administration duties
● Have experience in the public cloud - you are proficient with launching products in a variety of hosting solutions, including public
○ (Google, AWS, Azure, SalesForce) and private cloud systems.
● Have experience in Infrastructure as Code ( IAAS ) - you subscribe to Infrastructure as code mindset (Terraform, Helm, Chef)
● Experience with chaos testing and relevant software (Gremlin, FIS)
● Pursuing a degree in Computer Science or related technical experience and authorized to work in the U.S. without requiring sponsorship now or in the future.
Preferred Qualifications
● Previous internship or large scale project experience
● Experienced with at least one of the following languages: Golang or Python
● Familiarity with NodeJs, Java
● Have worked with CI/CD tooling such as Jenkins or Gitlab
● Preferred experience with alerting and monitoring tools such Appdynamics and Splunk
● Familiarity with:
○ SDLC Build and Release processes
○ Building docker images
○ Container orchestration: Kubernetes and ECS
● Proficiency with one of the following cloud providers: AWS, Google, Microsoft)
● Proficiency with:
○ Terraform, Helm or Chef*
○ Networking basics (routing, firewalls, AWS security groups)
○ Troubleshooting / analysis of applications: Splunk, appdynamics, grafana, etc
○ OS performance troubleshooting and ability to install and configure operating system packages
● Familiarity with
○ Oauth2
○ Security principles on patching, compliance, change control process
Required Education ● Pursuing a degree in Computer Science or related tech