What are the responsibilities and job description for the Site Reliability Engineer (SRE) position at BCforward?
BCFORWARD IS LOOKING FOR
Role SRE (Site Reliability Engineer
Worksite Address - 880 Powder Mill Road, #1, Wilmington, DE 19803
A Site Reliability Engineer (SRE) is responsible for ensuring the stability, performance, and availability of critical systems and applications by actively monitoring, automating processes, identifying potential issues, and rapidly responding to incidents, effectively bridging the gap between development and operations through a focus on reliability engineering practices and robust system design.
About Candidate :
SRE w / some DevOps & scripting
Hands on AWS experience
Spark, python & SQL, (PiSpark)
Any monitoring tools : Dynatrace, etc. any tool
Programming background Need to understand the application & the what
Key responsibilities :
Monitoring and Alerting :
Implement comprehensive monitoring systems to detect system anomalies, performance degradations, and potential failures, triggering timely alerts to relevant teams.
Incident Management :
Lead incident response efforts by quickly diagnosing the root cause of issues, coordinating with development teams to implement fixes, and conducting post-incident reviews to prevent recurrence.
Automation :
Develop and maintain automation scripts and tools to streamline repetitive tasks like system provisioning, scaling, deployment, and configuration management, reducing manual intervention.
Capacity Planning :
Proactively assess system capacity needs, identifying potential bottlenecks and scaling infrastructure to support growing demand.
System Reliability Improvement :
Analyze system behavior, identify areas for improvement, and implement changes to enhance overall system reliability and resilience.
System Design and Architecture :
Collaborate with development teams to design and architect systems with reliability and scalability in mind, including best practices for fault tolerance and redundancy.
Performance Optimization :
Analyze application performance metrics to identify areas for optimization and implement changes to improve system responsiveness.
Knowledge Sharing :
Document technical knowledge, best practices, and incident learnings in a shared knowledge base for team collaboration.
B. Skillset
a. Must Have Skills (AWS services hands on experience, Spark, Python, SQL)
b. Preferred / Ideal to have
i. Scheduling Tools Experience (Control-M, Autosys) or any other scheduling tools,
ii. Monitoring Tools Experience (Grafana, Dynatrace) or any other monitoring tools,
c. Good to have Spark, Shell / Perl Scripting, and / or Java
C. Experience
a. At least 5 years of experience in AWS, Big Data, Spark, SQL
b. At least 2-3 years in Shell Scripting
About Project :
Managing data PL migration another platform to cloud.
Scheduling experience
Moving data to cloud
Ability to work multiple tasks at any given time
Rotating weekend work (remote)
Day to Day rotating start time (need to be flexible) (Shift 1 : 9-5 / 6 _Shift 2 remote : 12 / 1 to 9pm est)