The HPC System Administrator will manage day-to-day operations of HPC systems, ensuring stability, security, and performance. This role includes system monitoring, patching, user account management, job queue oversight, and incident resolution to support NSCC?s supercomputing environment.
Roles and Responsibilities
System Operations & Maintenance
Administer HPC compute nodes, storage systems, and internal networks.
Monitor system health using tools like Grafana, Prometheus, and custom scripts.
Apply patches, updates, and configuration changes to ensure stability.
User & Job Management
Manage user accounts, access controls, and authentication mechanisms.
Monitor job queues and assist users with job submission and scheduling issues.
Implement and enforce resource allocation policies.
Incident Response & Troubleshooting
Respond to system alerts and user-reported issues.
Document incidents, resolutions, and preventive measures.
Collaborate with engineers for escalated issues.
Security & Compliance
Perform regular security checks and vulnerability assessments.
Ensure compliance with organizational and regulatory security policies.
Documentation & Reporting
Maintain system operation logs and configuration documentation.
Generate reports on system usage, performance, and incidents.
MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.