Research
My research interests, rooted in the philosophy that the local unit of intelligence is FLOPS,
span large model efficiency and elasticity (e.g., sparsity, adaptive compute),
as well as representation learning, multimodal systems, and visual generative models.
|
|
ELT: Elastic Looped Transformers for Visual Generation
Sahil Goyal*, Swayam Agrawal*, Gautham Govind Anil, Prateek Jain, Sujoy Paul, Aditya Kusupati
preprint
2026
arxiv /
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. We propose the idea of Intra-Loop Self Distillation
(ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count.
|
|
SegMASt3R: Geometry Grounded Segment Matching
Rohit Jayanti*, Swayam Agrawal*, Vansh Garg*, Siddharth Tourani, Haris Khan, Sourav Garg, Madhava Krishna
NeurIPS
2025
(Spotlight 🌟)
arxiv /
code /
website /
In this work, we establish image segment matching as a benchmark task & we propose a novel model architecture which enables high performance downstream on 3D Instance Mapping & Object-Relative Navigation. Segment matching is an important intermediate task in computer vision that establishes correspondences between semantically or geometrically coherent regions across images. While 2D Foundation models (e.g, DINOv2, SAM2) outperform a 3D foundation model (MASt3R) off-the-shelf for this task, fine-tuning both with a simple segment-matching head alongside SuperGlue style matching results in a surprising trend inversion with SegMASt3R achieving state-of-the-art performance, proving that explicit geometric reasoning is essential.
|
|
O3D-SIM: Open-set 3D semantic instance maps for vision language navigation
Laksh Nanwani, Kumaraditya Gupta*, Aditya Mathur*, Swayam Agrawal, Abdul Hafez, Madhava Krishna
Advanced Robotics Journal
2024
arxiv /
code /
website /
In this work, we extend instance-level semantic mapping to 3D. Using foundational models for object recognition, segmentation, and feature extraction, it creates a 3D point cloud with instance-level embeddings that enable language-guided navigation and object queries. The method improves both quantitative task success rates and qualitative instance identification, outperforming closed-set approaches in recognizing unseen objects.
|
|
* denotes equal contribution
|
|