M : τ

Saif

I work on ML systems — training infrastructure, kernel optimization, performance engineering.

The focus is usually on where performance disappears: memory movement, kernel overhead, communication bottlenecks, the gap between a research idea and something that runs (and scales) well on real hardware.

Lately spending most of my time on CUDA and Triton kernels, instrumentation, and end-to-end training pipelines.

I am available to work with teams on performance engineering for modern ML workloads — profiling, identifying bottlenecks and applying targeted fixes.

Research Interests

Training Performance and Systems

Interested in the systems layer of modern machine learning: model optimization, kernel efficiency, distributed training behavior, instrumentation and scaling dynamics.

Small Models and Efficient Experimentation

Exploring how far carefully designed systems and training methods can push smaller models under constrained compute budgets.

Projects

Training Systems Research

2025 – Present

Small-scale training systems built for throughput and kernel-level understanding. Triton kernels, RL post-training, constrained hardware.

Performance Engineering Notes

Ongoing

Technical notes on GPU architecture, memory systems, profiling, numerical stability and optimization.

Writing

Rotary Positional Embeddings Apr 23, 2026

Excerpt on RoPE I made wrote for teaching undergrads about LLMs.

Hello, World Jan 1, 1999

A first post to test MDX rendering with LaTeX support.

contact

twitter email