What are the responsibilities and job description for the Site Reliability Engineer position at F2Onsite?
Remote Site Reliability Engineer
Job Duties 2. Traffic Management Responsibilities 3. Infrastructure Management 4. Vendor Support/Escalation 5. Change Management 6. Runbook Management and Updates 7. Incident Management 8. In-House SRE Projects 9. Knowledge Sharing and Mentorship 10. Weekly, Monthly and Yearly Reports
Monitoring and Alert Response Alert Monitoring: System Health Checks: Ticket Management: Track new SRE tickets, resolve existing ones, and follow up to ensure closure. Assisting Cross-Functional Teams During Maintenance: Mitigating Production Issues: Facilitating Scheduled Hardware Maintenance: Proactively manage traffic flow adjustments to accommodate planned hardware maintenance, minimizing potential disruptions. On-Prem Server Management: Collaborative Infrastructure Tasks: Work closely with Network Engineering for major hardware/infrastructure changes/Upgrades Coordinate with Dev/DevOps and Platform team during on-premises to Cloud migrations. Troubleshoot deployment issue during CI/CD deployment process. Coordinate with vendors like: Dell: For hardware-related issues. ISP: (CenturyLink, Lumen, Zayo, Level3) For datacenter internet-related issues. VMware: For on-premises virtualization support. GCP: For Google Cloud support. Change Monitoring: Critical Change Support:
Job Duties 2. Traffic Management Responsibilities 3. Infrastructure Management 4. Vendor Support/Escalation 5. Change Management 6. Runbook Management and Updates 7. Incident Management 8. In-House SRE Projects 9. Knowledge Sharing and Mentorship 10. Weekly, Monthly and Yearly Reports
- Continuously monitor PagerDuty alerts, perform initial triage, and escalate major issues using predefined SRE runbooks and SOPs.
- Monitor and respond to alerts triggered over Slack and email.
- Acknowledge partner or vendor maintenance alerts and plan accordingly.
- Actively track system health using tools like Nagios, Pingdom, Grafana, Prometheus, QAMC , Splunk.
- Collaborate with various teams(ProdOps) to safely redirect traffic away from the data center during maintenance activities, ensuring uninterrupted service.
- In the event of production problems, promptly reroute traffic from the affected data center to maintain service continuity.
- Manage and configure on-premises servers, including onboarding and configuring new hardware.
- Onboarding to Monitoring tools and setting up alerts
- Review all production change tickets to ensure proper procedures are followed.
- Prevent unauthorized production changes.
- Work closely with cross-functional teams NetEng, DevOps, Dev , Platform teams during production changes.
- Help execute critical changes during maintenance windows, ensuring minimal disruption.
- Monitor and validate the impact of changes post-deployment.
- Maintain and improve existing SRE runbooks by adding new troubleshooting steps and solutions.
- Ensure SRE tasks have clear and detailed documentation for consistency.
- Incident Response:
- Act as the first responder during outages and provide updates in the Incident Management Slack channel.
- Offer timely updates to stakeholders during ongoing incidents.
- Incident Documentation:
- Record incident details, actions taken, and outcomes in SRE incident Tracking Tickets.
- Incident Resolution and RCA:
- Conduct root cause analysis (RCA) for incidents and document findings.
- Lead incident bridge calls and coordinate with stakeholders for resolution.
- Post-Incident Management:
- Conduct retrospectives/postmortems to evaluate incident handling and identify areas for improvement.
- Document incident timelines, resolution steps, and follow-up actions.
- Ensure completion of action items to prevent recurrence.
- Participate in weekly project sync-ups to plan and execute initiatives for system scalability and reliability.
- Optimize existing tools and applications.
- Conduct POCs and onboard new tools to enhance capabilities.
- Automate repetitive tasks to improve efficiency and reduce manual efforts.
- Meghana is currently contributing to an in-house SRE project focused on developing a Slack bot using Python to collect data from Google Cloud Platform (GCP).
- Create and update detailed SOPs, runbooks, and troubleshooting guides.
- Train and mentor New SRE member to enhance their technical skills.
- Share insights and lessons during team meetings and knowledge-sharing sessions.
- Meghana consistently ensures the timely delivery of monthly, yearly, and postmortem reports, demonstrating her commitment to transparency and continuous improvement on Incident management.