Skip to content
College Crest
Junior Research Fellow
Dr Guangzhi (Brian) Sun

In Brief…

  • My research focuses on large audio and video language models, particularly their controllability, reliability, and safety. Topics of interest include audio-visual input understanding, hallucination, knowledge integration, editing and unlearning, and AI safety.
  • I work in the Machine Intelligence Laboratory, Department of Engineering, where I am a co-supervisor for M.Eng Projects and help develop the MPhil course on Advanced Speech Recognition Lectures.
  • I supervise Part IA Engineering Maths for Trinity undergraduate students.
Dr Guangzhi (Brian) Sun
Junior Research Fellow

In Brief…

  • My research focuses on large audio and video language models, particularly their controllability, reliability, and safety. Topics of interest include audio-visual input understanding, hallucination, knowledge integration, editing and unlearning, and AI safety.
  • I work in the Machine Intelligence Laboratory, Department of Engineering, where I am a co-supervisor for M.Eng Projects and help develop the MPhil course on Advanced Speech Recognition Lectures.
  • I supervise Part IA Engineering Maths for Trinity undergraduate students.

Profile

I am a Junior Research Fellow starting from October 2024 at Trinity College, University of Cambridge. Before that, I worked with Prof Phil Woodland as a research associate at the Machine Intelligence Laboratory, the University of Cambridge. I am also closely collaborating with Prof. Mark Gales at the University of Cambridge and Prof Chao Zhang at Tsinghua University. My research interest is controllable and reliable audio-visual conversational AI with large language models. Specifically, this includes audio-visual contextual knowledge integration and editing, hallucination, and multimodal contextualised AI safety. My research interest and experience also include speaker representation, speech synthesis and source separation.

I completed my Ph.D. in June 2023. My PhD at the University of Cambridge supervised by Prof. Phil Woodland (advisor Prof Mark Gales) on contextual knowledge integration in end-to-end neural-based conversational AI systems. I held a research internship at Google Brain with Dr Yu Zhang in 2019 and ByteDance with Dr Wei Li in 2023. I was also grateful to have collaborated with Poly AI Ltd working with Dr Ivan Vulić and Dr Paweł Budzianowski in 2023. I received by BA and MEng degree in 2019 at Trinity College, University of Cambridge.

Teaching

1. MPhil Lectures: MLMI14 Advanced Speech Recognition: lecturer (2023-2024) and course developer (2022-2023)
2. Project Supervision: 2024-2025 M.Eng project co-supervisor on 4 projects for topics on speaker diarisation, speech separation and academic slides-to-script generation
3. Project Supervision: 2023-2024 MLMI Project supervisor for 3 projects for topics on visual large language models and optimisation
4. Undergraduate teaching: Supervision for Trinity, 2019-2021 IA Electrical, 2021-2025 IA Math
5. Lab Demonstration: MLMI2 Speech Recognition lab demonstrator (2022-2023)

Research

My goal is to make multimodal large language models (LLMs) that understand audio and video inputs, such as OpenAI GPT-4o and Google Gemini series, more reliable and controllable. Specifically, my ongoing research investigates the following aspects of multimodal LLMs:

1. Chain-of-thought reasoning of LLMs for general video understanding tasks: To solve a question about a video clip, the model should not only give the answer but also explain how this answer is derived step-by-step. This explanation provides more transparency to the model behaviour and hence enables us to better answer questions like “Why does it do this” or “What goes wrong” about the model,

2. Knowledge integration and removal in multimodal LLMs: In addition to the reasoning process, I also want to control what knowledge the model has by explicitly integrating knowledge into the model, or removing any knowledge for reasons such as privacy concerns or controversial issues.

3. Other safety and reliability issues, including hallucination detection in LLMs with audio and video inputs and contextualised LLM safety evaluation.

Selected Publications

Yu, W., Tang, C., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., Wang, Y. and Zhang, C., 2024. video-salmon: Speech-enhanced Audio-Visual Large Language Models. ICML.

Zhang, Z., Wu, W., Sun, G. and Zhang, C., 2024. Bayesian WeakS-to-Strong from Text Classification to Generation. ICLR.

Manakul, P., Liusie, A., Pipatanakul, K., Zhang, C., Woodland, P. and Gales, M., 2024. CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models. NeurIPS Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models.

Tang, C., Yu, W., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z. and Zhang, C., 2024. Salmonn: Towards Generic Hearing Abilities for Large Language Models. ICLR.

Luo, X., Rechardt, A. et al., 2024. Large Language Models Surpass Human Experts in Predicting Neuroscience Results. Nature Human Behaviour.

Zheng, X., Zhang, C. and Woodland, P.C., 2023. Can contextual biasing remain effective with Whisper and GPT-2? Interspeech.

Zhang, C. and Woodland, P.C., 2022. Tree-Constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition. Interspeech.

Zhang, Y., Weiss, R.J., Cao, Y., Zen, H. and Wu, Y., 2020. Fully-Hierarchical Fine-Grained Prosody Modeling for Interpretable Speech Synthesis. ICASSP.

Subject

Engineering

Contact

Back To Top

Access and Outreach Hub



Contact us

        Intranet | Student Hub