What are the responsibilities and job description for the Monitoring and Alerting Engineer position at Talent Groups?
Job Title: Monitoring and Alerting Engineer
Location: Fort Worth, TX
Job Type: Contract (12 months)
Experience Required: 10 years
About the Role: We are seeking a highly skilled Monitoring and Alerting Engineer with over 10 years of experience to manage and optimize the monitoring and alerting systems for our IT infrastructure. This role focuses on ensuring the availability, reliability, and performance of critical systems and applications by proactively identifying and addressing potential issues before they impact business operations.
Key Responsibilities:
- System Monitoring: Implement and manage monitoring solutions for the performance, health, and availability of IT systems, networks, and applications.
- Alert Management: Configure and handle alerting systems to ensure timely notifications of any issues.
- Incident Response: Collaborate with support teams to resolve incidents and outages swiftly.
- Root Cause Analysis: Investigate incidents, determine the root cause, and implement corrective actions.
- Optimization: Use data analysis to identify opportunities for system optimization and performance improvements.
- Tool Evaluation: Evaluate, recommend, and integrate monitoring and alerting tools to improve system efficiency.
- Documentation & Reporting: Maintain thorough documentation, including configurations, incident reports, and performance metrics.
- Collaboration: Work closely with internal IT teams and external vendors to ensure seamless operations.
Skills & Qualifications:
- Proficiency with monitoring tools (e.g., Dynatrace, Datadog, CloudWatch, Splunk)
- Strong understanding of IT infrastructure (servers, networks, cloud environments)
- Experience with incident, problem, and change management processes
- Strong troubleshooting and analytical skills
- Effective communication and collaboration with various IT teams
- Familiarity with ITIL best practices and service management frameworks
Performance Expectations:
- Ability to work in a 7x24 environment with on-call support as needed.
- Lead event resolution processes for mission-critical IT and Telecom systems.
- Monitor systems for performance issues and optimization opportunities.
- Participate in major incident response, escalate critical events when necessary.
- Conduct root cause analysis and identify chronic system issues.
- Collaborate with senior management to address critical business-impacting events.
Qualifications:
- Hands-on experience with tools like Dynatrace, AppMon, Zabbix, SCOM, Datadog, CloudWatch, X-Ray, and Splunk.
- Self-motivated and capable of managing critical incidents in a 24/7 environment.
- Experience managing high-priority system outages and interacting with cross-functional teams.
- Availability for after-hours support on a rotational basis.
Preferred Qualifications:
- Bachelor’s degree in Computer Science, Information Systems, or related field.
- Expertise in distributed systems, administration, and scripting/programming (Python, Node.js, Ruby, Perl, Bash).
- ServiceNow experience.
- Strong written and communication skills