MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Jiahao Xie; Wei Li; Xiaohang Zhan; Ziwei Liu; Yew Soon Ong; Chen Change Loy

MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy

Research output: Contribution to conference › Paper › peer-review

15 Citations (Scopus)

Abstract

We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach. Project page: https://www.mmlab-ntu.com/project/mfm/index.html.

Original language	English
Publication status	Published - 2023
Externally published	Yes
Event	11th International Conference on Learning Representations, ICLR 2023 - Kigali, Rwanda Duration: May 1 2023 → May 5 2023

Conference

Conference	11th International Conference on Learning Representations, ICLR 2023
Country/Territory	Rwanda
City	Kigali
Period	5/1/23 → 5/5/23

Bibliographical note

Publisher Copyright:
© 2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.

ASJC Scopus Subject Areas

Language and Linguistics
Computer Science Applications
Education
Linguistics and Language

Cite this

@conference{880069d1f69c425d96618769bba9eeab,

title = "MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING",

abstract = "We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach. Project page: https://www.mmlab-ntu.com/project/mfm/index.html.",

author = "Jiahao Xie and Wei Li and Xiaohang Zhan and Ziwei Liu and Ong, \{Yew Soon\} and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.; 11th International Conference on Learning Representations, ICLR 2023 ; Conference date: 01-05-2023 Through 05-05-2023",

year = "2023",

language = "English",

}

TY - CONF

T1 - MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

AU - Xie, Jiahao

AU - Li, Wei

AU - Zhan, Xiaohang

AU - Liu, Ziwei

AU - Ong, Yew Soon

AU - Loy, Chen Change

PY - 2023

Y1 - 2023

N2 - We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach. Project page: https://www.mmlab-ntu.com/project/mfm/index.html.

AB - We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach. Project page: https://www.mmlab-ntu.com/project/mfm/index.html.

UR - http://www.scopus.com/inward/record.url?scp=85164814550&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85164814550&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85164814550

T2 - 11th International Conference on Learning Representations, ICLR 2023

Y2 - 1 May 2023 through 5 May 2023

ER -

MASKED FREQUENCY MODELING FOR SELF-SUPERVISED VISUAL PRE-TRAINING

Abstract

Conference

Bibliographical note

ASJC Scopus Subject Areas

Other files and links

Cite this