Portrait of Klint Qinami

Klint Qinami

Software Engineer

About

I’m a machine learning systems engineer at Meta, working on compilers, frameworks, and kernels for MTIA training/inference as part of the PyTorch organization. I previously worked on compiler toolchains at the startup Reservoir Labs and continued that work at Qualcomm after the acquisition, focusing on machine-learning compilers for wide-vector VLIW DSP accelerators. I was briefly a Ph.D. student at Princeton, working on bias mitigation in machine learning. As an undergrad at Columbia, I worked on computer graphics, physics-based simulation, and geometry processing. I’m especially interested in performance optimization for large, real-world systems.

Projects

  • SESE Regions. Python implementation of the Johnson-Pearson-Pingali algorithm for canonical single-entry/single-exit regions and program structure trees, with Graphviz exporters for CFG and region visualization.
  • Offline PlantID. SwiftUI iOS app for offline plant identification using a TensorFlow Lite model trained on iNaturalist data, with on-device inference.
  • IMDb Movie Toolkit. CLI tool that aggregates IMDb titles by year with filters for votes, ratings, genres, title type, runtime, and output formats. HTML sample

Publications

Talks

  • Compiler-Driven Performance Optimization for Neural Networks. Klint Qinami. CDP Workshop 2025. Compiler optimization techniques developed for MTIA's next-generation architecture.
    Abstract

    We present compiler optimization techniques developed for MTIA's next-generation architecture, which delivers 3x performance improvement over the previous generation. Performance evaluation on production ranking and recommendation models demonstrates significant improvements in memory utilization and overall system efficiency. The techniques contribute to MTIA's 6x model serving throughput improvement and 1.5x performance-per-watt gains over the previous generation, enabling Meta to efficiently serve models ranging from low-complexity to high-complexity recommendation workloads with 10x-100x differences in model size. We describe a multi-stage compilation pipeline that leverages PyTorch's Inductor backend while introducing novel graph-level optimizations tailored for AI accelerators. Our approach addresses several key challenges: (1) tensor view elimination that converts explicit layout transformations into implicit tensor view manipulations, (2) memory-aware operator fusion strategies that consider both computational efficiency and memory hierarchy constraints, and (3) dynamic shape handling that maintains performance optimization paths despite runtime variability.

    The compiler uses memory placement strategies that automatically partition tensors between fast on-chip SRAM and external DRAM based on access patterns, lifetime analysis, and fallback strategies. When SRAM capacity is exceeded, our spilling mechanisms intelligently migrate data while minimizing performance impact. We also employ scheduling and tiling optimizations that decompose large tensor operations into smaller blocks that fit within memory constraints while maximizing data reuse. Additionally, graph-level transformations simplify and canonicalize graphs, eliminate redundant operations, and support both vertical and horizontal fusions to improve compute density.

Undergraduate Work

Computer Science

Math

Physics

Elsewhere