Portrait of Klint Qinami

Klint Qinami

Software Engineer

About

I’m a software engineer interested in performance optimization for large, real-world systems. I’m currently a member of technical staff at Anthropic.

Before Anthropic, I was a machine learning systems engineer at Meta, building compilers, frameworks, and kernels for MTIA training/inference within the PyTorch organization. Before that I built compiler toolchains at the startup Reservoir Labs and continued that work at Qualcomm following the acquisition, focusing on machine-learning compilers for wide-vector VLIW DSP accelerators. Earlier, I was a Ph.D. student at Princeton studying bias mitigation in machine learning, and as an undergraduate at Columbia I worked on computer graphics, physics-based simulation, and geometry processing.

Projects

  • SESE Regions. Python implementation of the Johnson-Pearson-Pingali algorithm for canonical single-entry/single-exit regions and program structure trees, with Graphviz exporters for CFG and region visualization.
  • Offline PlantID. SwiftUI iOS app for offline plant identification using a TensorFlow Lite model trained on iNaturalist data, with on-device inference.
  • IMDb Movie Toolkit. CLI tool that aggregates IMDb titles by year with filters for votes, ratings, genres, title type, runtime, and output formats. HTML sample

Publications

Talks

  • Compiler-Driven Performance Optimization for Neural Networks. Klint Qinami. CDP Workshop 2025. Compiler optimization techniques developed for MTIA's next-generation architecture.
    Abstract

    We present compiler optimization techniques developed for MTIA's next-generation architecture, which delivers 3x performance improvement over the previous generation. Performance evaluation on production ranking and recommendation models demonstrates significant improvements in memory utilization and overall system efficiency. The techniques contribute to MTIA's 6x model serving throughput improvement and 1.5x performance-per-watt gains over the previous generation, enabling Meta to efficiently serve models ranging from low-complexity to high-complexity recommendation workloads with 10x-100x differences in model size. We describe a multi-stage compilation pipeline that leverages PyTorch's Inductor backend while introducing novel graph-level optimizations tailored for AI accelerators. Our approach addresses several key challenges: (1) tensor view elimination that converts explicit layout transformations into implicit tensor view manipulations, (2) memory-aware operator fusion strategies that consider both computational efficiency and memory hierarchy constraints, and (3) dynamic shape handling that maintains performance optimization paths despite runtime variability.

    The compiler uses memory placement strategies that automatically partition tensors between fast on-chip SRAM and external DRAM based on access patterns, lifetime analysis, and fallback strategies. When SRAM capacity is exceeded, our spilling mechanisms intelligently migrate data while minimizing performance impact. We also employ scheduling and tiling optimizations that decompose large tensor operations into smaller blocks that fit within memory constraints while maximizing data reuse. Additionally, graph-level transformations simplify and canonicalize graphs, eliminate redundant operations, and support both vertical and horizontal fusions to improve compute density.

Undergraduate Work

Computer Science

Math

Physics

Elsewhere