Researchers have introduced A-SelecT (arXiv:2603.25758), a method for automatically selecting which timesteps in a diffusion transformer produce the most useful representations for downstream discriminative tasks. The work removes a practical barrier to using diffusion transformers as general-purpose visual encoders.
Diffusion transformers process images through a sequence of noise levels. Different timesteps capture different visual information: early timesteps with high noise tend to encode global structure and semantics, while later timesteps with low noise capture fine-grained details and textures. For any given downstream task, some timesteps are significantly more informative than others.
Current approaches either use a fixed timestep selected through expensive manual search or aggregate features across all timesteps, diluting useful signals with information from uninformative noise levels. Manual selection does not generalize across tasks, while full aggregation wastes computation without improving results.
A-SelecT learns to identify the most task-relevant timesteps during training with minimal computational overhead. On standard benchmarks, the method matches or exceeds manually optimized timestep selections while being fully automatic. As diffusion models increasingly compete with traditional vision transformers for discriminative tasks like classification, detection, and segmentation, automatic feature selection will be necessary for practical deployment where manual tuning per task is not feasible.
