Saif
I work on ML systems — training infrastructure, kernel optimization, performance engineering.
The focus is usually on where performance disappears: memory movement, kernel overhead, communication bottlenecks, the gap between a research idea and something that runs (and scales) well on real hardware.
Lately spending most of my time on CUDA and Triton kernels, instrumentation, and end-to-end training pipelines.
I am available to work with teams on performance engineering for modern ML workloads — profiling, identifying bottlenecks and applying targeted fixes.
Research Interests
Training Performance and Systems
Interested in the systems layer of modern machine learning: model optimization, kernel efficiency, distributed training behavior, instrumentation and scaling dynamics.
Small Models and Efficient Experimentation
Exploring how far carefully designed systems and training methods can push smaller models under constrained compute budgets.
Projects
Training Systems Research
2025 – PresentSmall-scale training systems built for throughput and kernel-level understanding. Triton kernels, RL post-training, constrained hardware.
Performance Engineering Notes
OngoingTechnical notes on GPU architecture, memory systems, profiling, numerical stability and optimization.
Writing
Excerpt on RoPE I made wrote for teaching undergrads about LLMs.
A first post to test MDX rendering with LaTeX support.