Senior/staff Site Reliability Engineer Infrastructure

Singapore, Singapore

Job Description


OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX\'s sponsorship of a visa.Who We AreAt OKX, we believe the future will be reshaped by technology. Founded in 2017, we are revolutionising world systems through our cutting-edge digital asset exchange, Web3 portal and blockchain ecosystems. We reshape the financial ecosystem by offering some of the most diverse and sophisticated products, solutions, and trading tools on the market. Trusted by more than 50 million users in over 180 countries globally, OKX empowers every individual to explore the world of Web3. With our extensive range of products and services, and unwavering commitment to innovation, OKX envisions a world of financial access backed by blockchain and the power of decentralized finance.We are innovative in the way we think, work, and in the products we create. We are also socially responsible by actively participating and encouraging employees to take part in various public welfare activities. With more than 3,000 employees around the world, we believe embracing diversity and inclusion will spark the creation of long-term value for the industry. Come Build the Future with Us now!About the TeamThe Service Reliability Engineering team envisions ensuring service stability as one of the company\'s core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.What You\'ll Be DoingEnsure stability and optimize big data platforms (Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (MaxCompute, Hologres, Hive, Clickhouse, StarRocks, etc.).Deeply understand the architecture and principles of middleware (Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway, etc.), ensuring high performance and usability.Effectively optimize existing runtime environments (KVM, Docker, K8S, JVM, etc.) to ensure efficient resource utilization and stable service operation.Comprehend network architecture and security, providing guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.Participate in rapid response and troubleshooting of system failures, continuously optimize monitoring strategies to reduce system downtime and ensure service continuity and stability.Drive infrastructure automation and intelligence to improve SRE work efficiency and quality.Collaborate closely with development teams, providing technical support and advice on infrastructure to jointly promote continuous product improvement and innovation.What We Look For In You

  • Bachelor\'s degree or above in Computer Science or related field, with 8+ years of experience in large-scale internet or cloud computing platform development/SRE/operations.
  • In-depth understanding of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with rich practical experience and troubleshooting skills.
  • Proficient in Linux system management and optimization, familiar with scripting languages such as Shell/Python, able to write automation tools and scripts.
  • Familiar with container and cloud-native technologies like KVM, Docker, and K8S, including their architectures and principles, with extensive experience in handling common issues and failures.
  • Familiar with network protocols such as TCP/UDP/QUIC, proficient in using network commands like TcpDump, TraceRoute, Netstat, and tools like Wireshark, with rich practical experience in troubleshooting common network issues.
  • Rich experience with Alibaba Cloud and AWS cloud products, from architecture to usage, with extensive practice in dealing with common issues and failures.
  • Practitioners with experience in service governance system construction, architecture optimization, stability assurance construction, capacity management, activity support, and chaos engineering are preferred.
  • Strong sense of responsibility and team spirit, with excellent problem-solving and analytical skills.
Perks & BenefitsCompetitive total compensation packageL&D programs and Education subsidy for employees\' growth and developmentVarious team building programs and company eventsWellness and meal allowancesComprehensive healthcare schemes for employees and dependantsMore that we love to tell you along the process!

OKX

Beware of fraud agents! do not pay money to get a job

MNCJobz.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.


Job Detail

  • Job Id
    JD1463440
  • Industry
    Not mentioned
  • Total Positions
    1
  • Job Type:
    Full Time
  • Salary:
    Not mentioned
  • Employment Status
    Permanent
  • Job Location
    Singapore, Singapore
  • Education
    Not mentioned