We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Principal SoC Performance Architect-Microbenchmarks

Advanced Micro Devices, Inc.
$200,000.00/Yr.-$300,000.00/Yr.
United States, Texas, Austin
7171 Southwest Parkway (Show on map)
Jun 23, 2026


WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE:

AMD is looking for an outstanding technical contributor to drive performance analysis, characterization, and optimization of next-generation Data Center GPU (DCGPU) platforms.

This role focuses on extracting maximum performance across the full system stack-including hardware, firmware, drivers, runtime, libraries, and workloads-through deep architectural understanding and data-driven methodologies. The engineer will develop and maintain microbenchmarks and system-level workloads spanning pre-silicon and post-silicon environments to enable performance validation, debug, and optimization.

THE PERSON:

As a passionate and technically strong Principal SoC Performance Engineer, you will work on highly parallel SoC architectures, leveraging deep understanding of GPU compute, memory hierarchy, and interconnects to analyze and optimize performance across AI and HPC workloads.

You will be responsible for building microbenchmark suites and workload-driven analysis frameworks that expose performance characteristics of GPU subsystems (compute, memory, IO, interconnect, collectives) and ensure continuity across pre-silicon models, emulation, and post-silicon systems.

The ideal candidate combines strong hardware/software co-design expertise with hands-on experience in performance analysis, profiling, and system-level debugging. You are expected to identify bottlenecks across the entire stack-from kernels to runtime to hardware-and translate insights into actionable improvements for both current and future architectures.

You thrive in a fast-paced environment, are highly data-driven, and have a deep curiosity for understanding "why" performance behaves the way it does.

KEY RESPONSIBILITIES:

  • Performance Analysis & Optimization
    • Analyze and optimize performance of DCGPU systems across AI training, inference, and HPC workloads
    • Identify bottlenecks across hardware, firmware, drivers, runtime, libraries, and applications
    • Perform deep kernel-level and system-level profiling to understand performance behavior
    • Provide actionable insights to architecture, software, and design teams to improve performance
    Microbenchmark & Workload Development
    • Design and develop targeted microbenchmarks to characterize GPU subsystems (compute, memory, interconnect, collectives)
    • Build representative system-level workloads reflecting real-world AI/HPC use cases
    • Ensure microbenchmarks correlate to application-level performance and architectural intent
    • Maintain and evolve benchmark suites across multiple GPU generations
    Pre-Silicon & Post-Silicon Continuity
    • Enable performance validation in pre-silicon environments (simulation/emulation/models)
    • Correlate performance data across pre-silicon models and post-silicon measurements
    • Develop methodologies to reuse workloads and microbenchmarks across the full lifecycle
    • Support bring-up and early silicon performance characterization
    Full-Stack Performance Engineering
    • Work across the entire software stack: compiler, runtime, libraries, drivers, and firmware
    • Collaborate with ROCm / AI frameworks / kernel teams to improve performance
    • Analyze interactions between workload characteristics and hardware execution
    • Optimize key kernels (e.g., GEMMs, collectives, attention) and system-level behavior
    Tooling & Infrastructure
    • Develop and enhance performance measurement, profiling, and analysis tools
    • Enable scalable, repeatable workflows for benchmarking and analysis
    • Build automation for performance regression tracking and reporting
    • Contribute to unified infrastructure spanning pre-silicon and post-silicon environments
    Cross-Functional Collaboration
    • Partner with SoC architecture, GPU IP, software, and system teams
    • Influence design decisions using data-driven performance insights
    • Collaborate with competitive analysis teams to understand gaps vs. industry platforms
    Performance Modeling & Insights
    • Develop strong intuition and/or models for performance scaling and limits
    • Translate performance data into architectural feedback for future GPU designs
    • Support competitive benchmarking and performance projections

PREFERRED EXPERIENCE:

  • 10-15+ years of experience in performance engineering for GPUs, HPC systems, or highly parallel SoCs
  • Strong understanding of GPU architecture, parallel computing, and memory hierarchies
  • Experience with microbenchmark development and system-level workload analysis
  • Hands-on experience with performance profiling tools (rocprof, Nsight, perf, etc.)
  • Experience analyzing AI/HPC workloads (LLMs, training, inference, communication libraries like RCCL/NCCL)
  • Strong background in hardware/software co-design and performance optimization
  • Familiarity with pre-silicon (simulation/emulation/models) and post-silicon performance workflows
  • Programming expertise in C/C++, Python; experience with GPU programming models (HIP, CUDA, OpenCL)
  • Strong analytical and debugging skills with a data-driven mindset
  • Experience working across full software stack (compiler runtime kernels system)
  • Exposure to performance modeling, scaling analysis, or competitive benchmarking is a plus
POSITION REQUIREMENTS:
  • Proven experience working on highly parallel compute systems or SoCs (GPUs preferred)
  • Experience developing and maintaining microbenchmarks tied to architectural features
  • Strong exposure to performance analysis across pre-silicon and post-silicon environments
  • Solid understanding of GPU compute, memory systems, and interconnect architectures
  • Experience with profiling, tracing, and performance counter analysis
  • Ability to debug complex system-level performance issues across multiple layers
  • MS/PhD in Computer Engineering, Computer Science, or related field
  • Excellent communication skills and ability to present complex performance insights clearly

ACADEMIC CREDENTIALS:

  • Bachelor's or Master's degree in related discipline preferred

This role is not eligible for visa sponsorship.

#LI-RL1

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here.

This posting is for an existing vacancy.

Applied = 0

(web-77cf7d65c7-rcc7h)