Role: GPU Software Engineer
Location: San Jose, CA – Onsite
Duration: 12+ Months/Contract-to-Hire
Pay Rate: $90 t0 $100
Overview:
We are seeking an experienced GPU Software Engineer for a 12-month milestone-based engagement supporting a cutting-edge GPU software integration project. The consultant will work on AMD GPU platforms, drive AI stack development, contribute to open-source projects, and deliver performance benchmarking and integration reports across a structured set of monthly deliverables. This is a highly technical, hands-on role requiring deep expertise in GPU software stacks, ROCm, AI frameworks, and systems-level integration.
Manager’s Note: Client confirmed flexibility around AMD-specific experience and is open to strong GPU software engineers from the NVIDIA/CUDA ecosystem, provided they possess solid GPU architecture fundamentals and can ramp up on ROCm. ROCm ecosystem exposure is considered a key factor for success, while MI210 deliverables will have onboarding support. Open-source contributions to SGLang will be coordinated through Samsung channels, with additional clarification pending on ramp-up timing before milestone tracking begins.
Position Details:
- Project Title: GPU SW Integration for Samsung Cognos
- Engagement Type: Contract / Milestone-Based (12 Months)
- Client Environment: AMD MI210 GPU, CXL Memory, NVMe Gen6, ROCm Stack
- Delivery Tools: Confluence, Jira, GitHub/GitLab (client-provided)
Key Responsibilities:
- Design and develop GPU software modules aligned with project milestones.
- Perform systems integration and end-to-end testing of AI stack SW modules.
- Validate AMD Infinity Bridge and AIS on MI210 GPU hardware.
- Conduct functional and performance benchmarking (pSLC Firmware, CXL, ROCm).
- Implement and validate SGLang changes for L3 to L1 memory transfer optimization.
- Develop and contribute CaMa module changes to the ROCm software stack.
- Collaborate with the SGLang open-source community and contribute code to their public GitHub repo.
- Develop CaMa module for ROCm over Infinity Fabric/Ethernet.
- Perform E2E performance benchmarking and publish formal benchmarking reports.
- Integrate CaMa changes into the Cognos AI stack and publish integration documentation.
- Scope UALink support for CaMa and publish an investigation/feasibility document.
- Maintain all documentation, code, and status updates in Confluence, Jira, and GitHub/GitLab.
Required Skills and Qualifications:
GPU Software and Hardware
- Hands-on experience with AMD GPU platforms, specifically MI210.
- Proficiency with AMD ROCm software stack including kernel libraries and drivers.
- Experience with AMD Infinity Bridge / Infinity Fabric architecture.
- Familiarity with CXL (Compute Express Link) memory integration.
- Experience with NVMe storage and GPU Direct Storage (GDS).
AI Frameworks and Software Stack
- Experience with SGLang or similar LLM inference frameworks.
- Familiarity with AI stack installation and end-to-end workload benchmarking.
- Knowledge of GPU memory hierarchy (HBM, L1/L3 cache) and data transfer optimization.
- Proficiency in GPU kernel programming and library management (e.g., GDS, CaMa).
Programming and Tools
- Strong proficiency in C/C++ and Python for GPU/systems-level development.
- Experience with open-source contribution workflows (GitHub, pull requests, code reviews).
- Familiarity with Jira and Confluence for project management and documentation.
- Experience with pSLC firmware validation and performance benchmarking methodologies.
Soft Skills
- Ability to work independently and deliver against defined monthly milestones.
- Strong written communication skills for publishing technical reports and documentation.
- Collaborative mindset; ability to work with third-party teams (AMD, SGLang community).
Preferred Qualifications:
- Prior experience with Samsung Cognos AI stack or similar enterprise AI platforms.
- Familiarity with UALink protocol and its GPU interconnect applications.
- Prior open-source contributions to ROCm, SGLang, or similar GPU frameworks.
- Experience presenting benchmarking results to semiconductor partners (AMD, NVIDIA, etc.).


