Demo

Incident Management Specialist (System Analyst)

Datum Software, Inc.
Reston, VA Full Time
POSTED ON 2/4/2025
AVAILABLE BEFORE 4/4/2025

Job Details

Incident Management Specialist AWS Infrastructure
Location: Reston, VA (Hybrid)

Summary Experienced IT professional specializing in incident management and application triage within a 24/7/365 environment. Skilled in troubleshooting, diagnosing, and resolving production incidents in AWS infrastructure, with expertise in performance monitoring, root cause analysis, and incident resolution. Proven ability to work effectively with cross-functional teams, manage incident status and impact, and lead technical triage calls. Strong communication and relationship management skills, capable of delivering clear updates to both technical and non-technical stakeholders.
What We Are Seeking We are seeking a candidate with expert-level knowledge in AWS infrastructure and services, particularly in the context of application triage and incident management. This role focuses on transaction tracing, log analysis, and incident resolution using AWS Console and other monitoring tools. We are not looking for candidates with solely development or deployment experience in AWS, but rather those with hands-on experience in diagnosing and resolving incidents in a cloud environment.
Key Responsibilities

  • Incident Management: Lead and manage IT production incidents to resolution, ensuring minimal downtime and effective communication of incident status, impact, and resolution.
  • AWS Expertise: Hands-on experience with Amazon Web Services (AWS), including EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, CloudTrail, WAF, and more.
  • Cloud Monitoring: Build and leverage tools for monitoring and troubleshooting system resources in AWS, using platforms like Dynatrace, Splunk, SolarWinds, and MoogSoft.
  • Root Cause Analysis: Perform detailed transaction-level monitoring and troubleshooting of AWS infrastructure, including web, database, storage, and network layers.
  • Incident Triage: Lead technical incident triage calls, analyze system performance, and resolve incidents swiftly using monitoring tools and diagnostics.
  • Process Improvement: Proactively identify opportunities to improve operational processes, implement recommendations, and contribute to postmortem analysis for continuous improvement.
  • Collaboration: Work closely with other technical teams to influence incident resolution and share insights during follow-up calls and root cause analysis.
  • Stakeholder Communication: Provide timely updates and detailed reports on incident status and post-resolution metrics to senior leadership.

Core Skills & Technologies

  • AWS Services: EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, CloudTrail, WAF
  • Incident Management: Hands-on management of IT incidents, triage, and resolution
  • Monitoring Tools: Dynatrace, Splunk, SolarWinds, MoogSoft, Extrahop, Catchpoint
  • Root Cause Analysis: Incident troubleshooting, transaction tracing, and diagnostics
  • Cloud Infrastructure: Performance engineering, resource monitoring, and cloud operations
  • Communication: Strong written and verbal skills, including executive-level reporting and cross-functional collaboration
  • Technical Areas: AWS, Unix/Linux servers, Wintel servers, networks, databases (Oracle, MS SQL), SAN, virtualization

Qualifications :

  • Manage and resolve complex incidents within AWS infrastructure, providing timely updates to stakeholders and ensuring minimal production downtime.
  • Lead incident triage calls, analyzing application and infrastructure health using AWS and third-party monitoring tools (e.g., Dynatrace, Splunk).
  • Collaborate with cross-functional teams to diagnose root causes and implement corrective actions, ensuring a quick resolution for high-priority incidents.
  • Design and improve incident management processes, proactively recommending changes to minimize recurring issues and enhance system stability.
  • Conduct postmortem analysis for critical incidents, documenting root cause, corrective actions, and lessons learned to improve future performance.
  • AWS Cloud Operations Specialist
  • Provided hands-on support for AWS-based applications, including incident monitoring, root cause analysis, and performance troubleshooting.
  • Implemented tools and dashboards for monitoring AWS infrastructure performance, improving incident detection and response times.
  • Worked closely with development and operations teams to resolve complex production issues, ensuring timely and effective solutions.
  • Supported the transition of legacy systems to AWS, optimizing application performance and operational efficiency.
  • Bachelor's degree in information technology
  • AWS Certified Solutions Architect Associate
  • AWS Certified DevOps Engineer Professional (Optional)
  • Certified Incident Management Professional (Optional)

Preferred Skills & Experience:

  • Experience with Service-Oriented Architecture (SOA) and Middleware management in UNIX/Linux environments.
  • Prior experience in the financial industry or with high-transaction applications.
  • Familiarity with OpenTel and advanced transaction monitoring tools
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Incident Management Specialist (System Analyst)?

Sign up to receive alerts about other jobs on the Incident Management Specialist (System Analyst) career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$115,647 - $153,495
Income Estimation: 
$186,685 - $265,377
Income Estimation: 
$115,647 - $153,495
Income Estimation: 
$186,685 - $265,377
Income Estimation: 
$71,440 - $92,105
Income Estimation: 
$87,466 - $114,731
Income Estimation: 
$115,647 - $153,495
Income Estimation: 
$87,466 - $114,731
Income Estimation: 
$114,790 - $146,930
Income Estimation: 
$115,647 - $153,495
Income Estimation: 
$114,790 - $146,930
Income Estimation: 
$142,618 - $183,267
Income Estimation: 
$115,647 - $153,495
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Datum Software, Inc.

Datum Software, Inc.
Hired Organization Address Atlanta, GA Full Time
Job Details Currently, we have an opening for Sr. Program Manager with our Client in Atlanta, GA . I appreciate your tim...
Datum Software, Inc.
Hired Organization Address Atlanta, GA Full Time
Job Details Job Details: Job Title: Senior PKI Security Engineer (Public Key Infrastructure) Duration: Long-Term Contrac...
Datum Software, Inc.
Hired Organization Address Atlanta, GA Full Time
Job Details Job Details: Job Title: Salesforce Developer Duration: Long-Term Contract Location: Atlanta, Georgia 30334 |...
Datum Software, Inc.
Hired Organization Address Atlanta, GA Full Time
Job Details Programmer Analyst Atlanta,GA Candidate MUST be local to Metro Atlanta The EIP module has matured over the p...

Not the job you're looking for? Here are some other Incident Management Specialist (System Analyst) jobs in the Reston, VA area that may be a better fit.

Incident Management Specialist

Datum Technologies Group, Reston, VA

Cyber Incident Management Analyst

Verizon, Ashburn, VA

AI Assistant is available now!

Feel free to start your new journey!