Presented by Martine Hjelkrem-Tan, PhD Research Fellow at the Digital Signal Processing and Image Analysis at the University of Oslo
In this talk, we discuss how sparsity and sampling can be used for improving modern vision models. We show that these principles can help us discover which regions of an image a model chooses to attend to perform a given task. Secondly, we show that self-supervised foundational models exhibit unwanted positional noise in patch tokens and propose a simple cleaning method. Finally, we discuss how these findings can help guide future frameworks for training foundational vision models.