Introducing

Mind-Video

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
We propose Mind-Video, which progressively learns spatiotemporal information from continuous fMRI data through masked brain modeling + multimodal contrastive learning + spatiotemporal attention + co-training with an augmented Stable Diffusion model that incorporates network temporal inflation.

This work has been accepted by NeurIPS 2023 for oral presentation.

This is an extension of our previous fMRI-Image reconstruction work: MinD-Vis(CVPR 2023)
papergithub

Ground truth Videos

Reconstructed Videos

card
card
card
card
card
card
card
card
card

*Equal contribution   #Corresponding author

Motivation & Research Gap

first_fig

Brain decoding & video reconstruction

Reconstructing human vision from brain activities has been an appealing task that helps to understand our cognitive process. Even though recent research has seen great success in reconstructing static images from non-invasive brain recordings, work on recovering continuous visual experiences in the form of videos is limited.

We identified three gaps between video reconstruction and our previous image reconstruction work:

  • The hemodynamic response results in a time delay when processing dynamic neural activities. This time lag can make it challenging to accurately track real-time brain responses to stimuli.
  • Our previous work, Mind-Vis, currently lacks both pixel-level and semantic-level guidance. This omission could impact the tool's effectiveness in generating accurate reconstructions.
  • There is a need to enhance the generation consistency in our process while ensuring the dynamics of the scene within one fMRI frame are preserved. This balance is key to accurate and stable reconstruction over one fMRI time frame.

Mind-Video Design

flowchart

In this work, we present Mind-Video, a two-module pipeline designed to bridge the gap between image and video brain decoding. These two modules are trained separately, then finetuned together.

Our model progressively learns from brain signals, gaining a deeper understanding of the semantic space through multiple stages in the first module.

  • Initially, we leverage large-scale unsupervised learning with masked brain modeling to learn general visual fMRI features. A spatiotemporal attention is also designed to process multiple fMRI in a sliding window.
  • We then distill semantic-related features using the multimodality of the annotated dataset, training the fMRI encoder in the CLIP space with contrastive learning.
  • In the second module, the learned features are fine-tuned through co-training with an augmented stable diffusion model, which is specifically tailored for video generation under fMRI guidance.

Contribution

  • We introduced a flexible and adaptable brain decoding pipeline decoupled into two modules: an fMRI encoder and an augmented stable diffusion model, trained separately and finetuned together.
  • We designed a progressive learning scheme where the encoder learns brain features through multiple stages, including multimodal contrastive learning with spatiotemporal attention for windowed fMRI.
  • We recovered high-quality videos with accurate semantics, e.g., motions and scene dynamics. Results are evaluated with semantic and pixel metrics at video and frame levels. An accuracy of 85% is achieved in semantic metrics and 0.19 in SSIM, outperforming the previous state-of-the-art approaches by 45%.
  • The attention analysis revealed mapping to the visual cortex and higher cognitive networks suggesting our model is biologically plausible and interpretable.

Results - Compare with Benchmarks

benchmark

We compare our results with the samples provided in multiple previous literature in fMRI-Video reconstruction task. We also compare our results with our fMRI-Image pipeline. Our method generates samples that are more semantically meaningful and match with the groundtruth.

Results - Ablation study

ablation

Results - Learn from Brain

interpretation

Our attention analysis of the transformers decoding fMRI data has yielded three significant insights:

  • Dominance of the Visual Cortex: Our analysis underscores the critical role of the visual cortex in processing visual spatiotemporal information. However, higher cognitive networks, such as the dorsal attention network and the default mode network, also contribute to the visual perception process.
  • Layer-Dependent Hierarchy: The layers of our fMRI encoder operate in a hierarchical fashion. Initial layers focus on structural information, while deeper layers shift toward learning more abstract visual features, indicating a gradient of complexity in feature extraction.
  • Progressive Semantic Learning: Our fMRI encoder evolves through each learning stage, showing increased attention to higher cognitive networks and decreased focus on the visual cortex over time. This progression suggests the encoder improves its ability to assimilate more nuanced, semantic information throughout its training stages.

More Samples

Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated

Fail Cases

In short, the failure cases can be attributed by two factors:

  • Lack of pixel-level controllability. Due to the probabilistic nature of the diffusion model and the current conditioning method, the generation process lacks strong control from the fMRI latent to generate strictly matching low-level features, such as shapes, color, and geometric information. We believe this would be an important perspective for future research on this task.
  • Uncontrollable factors during the scan. Mind wandering and imagination of the subject are usually inevitable during the scan. It has been shown that imagination is involved and can be decoded to some extent from the visual cortex, which can lead to mismatching between the ground truth and the generation results.
Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated
Groundtruth / Generated

Media Coverage

Mind-X

Mind-X is a research interest group that aims to explore multimodal brain decoding with large models. It was first initiated by  Zijiao Chen (NUS CSC),  Jiaxin Qing (CUHK IE),  Tiange Xiang (Stanford AI Lab) and  Prof. Juan Helen Zhou (NUS CSC) in 2022. We aim to exploit the power of recent advances in large models and AGI to advance the field of brain decoding. Our ultimate goal is to develop general-purpose brain decoding models that empowers various applications in brain-computer interface, neuroimaging, and neuroscience.

Acknowledgments

Big shoutout to our friend (Tiange Xiang) for all the stimulating chats and feedback in this work, and to (Jonathan Xu) for crafting the website for our project. And thanks all members in  Multimodal Neuroimaging in Neuropsychiatric Disorders Laboratory  for all the support and help. 🙌

Huge thanks to the Human Connectome Project (HCP) for the large-scale fMRI data and to Prof. Zhongming Liu and Dr. Haiguang Wen for the awesome (fMRI-Video dataset).

Can't forget the (Stable Diffusion team) for sharing their super impressive large model with everyone - you guys rock! And kudos to the (Tune-A-Video team), you inspired us with your text-to-video pipeline. 🚀👏