Multimodal Learning

Visual Instruction Tuning with Polite Flamingo

During visual instruction tuning of multi-modal LLM, we introduced a multi-modal response rewriter called “Polite Flamingo” to address the degeneration of response politness, which is a typical instance of the “multi-modal alignment tax”.

Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang. “Visual Instruction Tuning with Polite Flamingo”. In AAAI (2024).

Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

We introduced RemoteCLIP, the first general-purpose vision-language foundation model for remote sensing. RemoteCLIP outperform previous image-text retrieval SoTA by 9.14% mean recall on RSICD dataset and by 8.92% on RSICD dataset. For zero-shot classification, our RemoteCLIP outperform CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets.

Fan Liu, Delong Chen (joint first author), Qingyunguan Zhang et al. “RemoteCLIP: A Vision Language Foundation Model for Remote Sensing”. Arxiv Preprint (2023).

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Jun Zhou

Taming Diffusion Models for Music-driven Conducting Motion Generation

This paper presents Diffusion-Conductor , a novel DDIM-based approach for music-driven conducting motion generation, which integrates the diffusion model to a two-stage learning framework. [arxiv]

Zhuoran Zhao, Jinbin Bai, Delong Chen, Debang Wang, Yubo Pan

MEP-3M: A Large-scale Multi-modal E-Commerce Products Dataset

We construct a large-scale Multi-modal E-commerce Products classification dataset MEP-3M, which consists of over 3 million products and 599 fine-grained product categories. Previsouly, the conference version of this paper won IJCAI 2021 LTDL Best Dataset Paper award.

Fan Liu, Delong Chen (joint first author), Xiaoyu Du, et al. “MEP-3M: A Large-scale Multi-modal E-Commerce Products Dataset”. In Pattern Recognition (2023).

Fan Liu, Delong Chen, Xiaoyu Du, Ruizhuo Gao, Feng Xu

A Review of Driver Fatigue Detection and Its Advances on the Use of RGB-D Camera and Deep Learning

In this review, we summarize the latest research findings and analyze the developmental trends of driver fatigue detection. We present the work on integration of RGB-D camera and deep learning, where Generative Adversarial Networks and multi-channel schemes are utilized to enhance the performance. [DOI]

Fan Liu, Delong Chen, Jun Zhou, Feng Xu

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

This paper proposed ProtoCLIP for improved representation grouping and enhanced robustness against modality gap in large-scale vision language pretraining. ProtoCLIP improved linear probing and zero-shot accuracy by 5.8% and 2.0%, and matched the performance of CLIP with 3×fewer epochs.

Delong Chen, Zhao Wu, Fan Liu, et al. “ProtoCLIP: Prototypical Contrastive Language Image Pretraining” In IEEE Transactions on Neural Networks and Learning Systems, TNNLS (2023).

Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Shaoqiu Zheng, Ying Tan, Erjin Zhou

Self-Supervised Music Motion Synchronization Learning for Music-Driven Conducting Motion Generation

This paper proposed the first deep-learning based music-driven conducting motion generation method, and presented a large-scale music motion dataset ConductorMotion100 with unprecedented 100 hours length. The associated demo paper won the Best Demo Award in IEEE ICME 2021. My graduation thesis at HHU on this project was awarded as “First Class of Outstanding Graduation Thesis of Jiangsu Province” (江苏省优秀本科毕业论文一等奖).

Fan Liu, Delong Chen (corresponding author), et al. “Self-Supervised Music Motion Synchronization Learning for Music-Driven Conducting Motion Generation”. In Journal of Computer Science Technology, JCST (2022).

Fan Liu, Delong Chen, Ruizhi Zhou, Sai Yang, Feng Xu

Self-Supervised Music Motion Synchronization Learning for Music-Driven Conducting Motion Generation

MEP-3M: A Large-scale Multi-modal E-Commerce Products Dataset

In this paper, we construct a large-scale Multi-modal E-commerce Products classification dataset MEP-3M, which consists of over 3 million products and 599 fine-grained product categories. This paper won Best Dataset Paper award.

Delong Chen, Fan Liu, Xiaoyu Du, Ruizhuo Gao, Feng Xu

VirtualConductor: Music-driven Conducting Video Generation System

In this demo, we present the VirtualConductor, a system that can generate conducting video from a given piece of music and a single user’s image. This demo won the IEEE ICME 2021 Best Demo award.

Delong Chen, Fan Liu, Zewen Li, Feng Xu. In ICME 2021 (Demo Track).

Delong Chen, Fan Liu, Zewen Li, Feng Xu

VirtualConductor: Music-driven Conducting Video Generation System