ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing

This ICME grand challenge focuses on video super-resolution for video conferencing, where the low-resolution video is encoded using the H.265 codec with fixed QPs. The goal is to upscale the input LR videos by a specific factor and provide HR videos with perceptually enhanced quality (including compression artifact removal). We follow the low-delay scenario in the entire challenge, where no future frames should be used to enhance the current frame. Additionally, there are three tracks specific to video content:

Track 1: General purpose real-world video content , x4 upscaling
Track 2: Talking Head videos, x4 upscaling
Track 3: Screen sharing videos, x3 upscaling

Separate training, validation, and test sets will be provided for each track by organizers. The training set for Track 1 is based on REDS dataset [20], for Track 2 it is an extension of the VCD dataset [21] and includes real-world recordings, and for track 3 it is based on publicly available datasets like [22]–[24]. The validation set includes 5 source video clips with 300 frames for each track, encoded with H.265 using 4 fixed QPs. The test set will be blind and include 20 source video clips per track, prepared similarly to the validation set.
See below for further information about each track. The training and validation sets will be published in challenge’s GitHub web page (opens in new tab) on start date.

Evaluation criteria

The video clips in the test set are 10 seconds long and have a frame rate of 30 frames per second (FPS). The validation and test set videos from Track 1 and Track 2 are real-world videos. Participants are free to use any addition training data. The VSR method does not need to use additional frames, i.e., it can be a single image SR model. Participating teams should process the provided test set using their models and submit the resulting video clips. Each team may participate in one or more tracks, using the same or different models for each track.

Models must follow the low delay setting, i.e., they must not use any future frames or information when enhancing the current frame. Submissions will be ranked based on subjective quality, measured according to the crowdsourcing implementation of ITU-T Rec. P.910 [25]. We will calculate the Mean Opinion Score (MOS) for each processed clip. Submissions in each track will be ranked based on the average MOS and the 95% Confidence Interval of the MOS scores for all their processed clips. For Track 3 Word Accuracy Rate will also be included in ranking score (see blow). Additionally, we will provide objective metrics in the final report.

Submission guidelines

Participants must process video sequences in the blind test set using their model and submit the processed video sequences before the competition’s end date. Only the latest submission before the deadline will be considered per team and track. To be included in the competition, a complete submission is required, which means a processed video sequence that is upscaled by the specified factor for each input video clip. The number of frames must match between the input and processed clips. No external tools are allowed for enhancing the quality of processed clips.

Although this challenge does not target model efficiency, we require participating teams to report the runtime, number of parameters, and FLOPs of their models. Utility scripts will be provided by the organizers.

Participants in the grand challenge must submit a short paper (up to 4 pages). The papers should describe their models, any additional data used in training, and the performance on the validation set.

If authors are participating in multiple tracks, the differences between models and/or training should be stated in the paper. The ranking in the final competition can be added to the camera-ready version if the paper is accepted. If no paper is submitted or it is shorter than 3 pages, the team will be removed from the competition. Authors of accepted papers will also have a chance to present their work during the ICME 2025 conference’s grand challenge session. Further guidelines will be provided on the challenge website.

Tracks

Track 1 – General purpose real-world video content

This track addresses Real-world Video Super Resolution without specifying a content type. The task involves x4 upscaling and removing compression artifacts from the input video. The model inputs are low-resolution (LR) videos encoded with H.265 using constant Quantization Parameters (QP). Models must not use future frames when upscaling and enhancing the current frame.

We provide training, validation, and test sets. Participants can also use other data for training their models however they should provide details in their paper.
The training and validation sets are based on the REDS dataset (Publications Datasets CV | Seungjun Nah (opens in new tab)) [20]. Besides 5 video clips, the rest of the original REDS validation set is included in the training set. The training and validation set are available in Challenges GitHub page (opens in new tab). The blind test set will only be provided to registered teams, one week before the challenge end dates. We used low-resolution videos (originally downscaled by bicubic) from REDS and encoded them with H.265. Figure 1 illustrates the process.

Figure 1 – Data flow diagram for Track 1

Track 2 – Talking Head videos

This track addresses Real-world Video Super Resolution for Talking Head content type. The task involves x4 upscaling and removing compression artifacts from the input video. The videos are from real-world where other artifacts may have been included in the original recording. The model inputs are low-resolution (LR) videos encoded with H.265 using constant Quantization Parameters (QP). Models must not use future frames when upscaling and enhancing the current frame.

We provide training, validation, and test sets. Participants can also use other data for training their models however they should provide details in their paper.
The training and validation sets are an extension of the VCD dataset [21] (https://github.com/microsoft/VCD (opens in new tab)) and may include landscape or portrait videos. The training and validation set are available in Challenges GitHub page (opens in new tab). The blind test set will only be provided to registered teams, one week before the challenge end dates. Figure 2 illustrates the data preparation process and figure 3 presents thumbnail images from portion of training set.

Thumbnail images of the Track 2's training set. — Figure 3 – Thumbnail images of the Track 2’s training set.

This track addresses Video Super Resolution for screen sharing content type. The task involves x3 upscaling and removing compression artifacts from the input video. The videos are recorded from different productivity application in which one is performing some relevant tasks. The model inputs are low-resolution (LR) videos encoded with H.265 using constant Quantization Parameters (QP). Models must not use future frames when upscaling and enhancing the current frame.

We provide training, validation, and test sets. Participants can also use other data for training their models however they should provide details in their paper. The training and validation set are available in Challenges GitHub page (opens in new tab). The blind test set will only be provided to registered teams, one week before the challenge end dates.
For this particular track the challenge metric will be combination of Subjective Mean Opinion Score (MOS) and Character Error Rate determined by applying OCR on multiple sections of specific frames in the test set (See Equation 1). Sample code will be provided during challenge time for testing on validation set.

track3_ score — Equation 1: Challenge score for Track3 per clip, where CER is average of Character Error Rate on multiple frames.

Figure 4 illustrates the data preparation process and figure 5 presents thumbnail images from portion of training set.

Thumbnail images of portion of the Track 3's training and validation sets. — Figure 5 – Thumbnail images of portion of the Track 3’s training and validation sets.

References

[20] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee, “Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study,” in CVPR Workshops, June 2019.
[21] Babak Naderi, Ross Cutler, Nabakumar Singh Khongbantabam, Yasaman Hosseinkashi, Henrik Turbell, Albert Sadovnikov, and Quan Zou, “VCD: A video conferencing dataset for video compression,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 3970–3974.
[22] Shan Cheng, Huanqiang Zeng, Jing Chen, Junhui Hou, Jianqing Zhu, and Kai-Kuang Ma, “Screen content video quality assessment: Subjective and objective study,” IEEE Transactions on Image Processing, vol. 29, pp. 8636–8651, 2020.
[23] Yingbin Wang, Xin Zhao, Xiaozhong Xu, Shan Liu, Zhijun Lei, Mariana Afonso, Andrey Norkin, and Thomas Daede, “An open video dataset for screen content coding,” in 2022 Picture Coding Symposium (PCS). IEEE, 2022, pp. 301–305.
[24] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
[25] Babak Naderi and Ross Cutler, “A crowdsourcing approach to video quality assessment,” in ICASSP, 2024.