Multi-visual-Modality Human Activity Understanding (MMHAU)

ACCV 2020 Workshop, Kyoto, Japan


December 3, 2020 (17:00-19:00 (Beijing, UTC+8); 21:00-23:00 (Beijing, UTC+8))


Session One: 17:00-19:00 (Beijing, China Standard Time, UTC+8)

Beijing 17:00~17:25

Vijay John, Ali Boyali, Simon Thompson, Annamalai Lakshmanan, and Seiichi Mita. Visible and Thermal Camera-based Jaywalking Estimation using a Hierarchical Deep Learning Framework


Beijing 17:25~17:50

Shihao Zhou, Mengxi Jiang, Qicong Wang, and Yunqi Lei. Towards Locality Similarity Preserving to 3D Human Pose Estimation


Beijing 17:50~18:15

Shigenori Nagae, and Yamato Takeuchi. Iterative Self-distillation for Precise Facial Landmark Localization

Beijing 18:15~18:40

Ao Li, Jia jia Chen, Deyun Chen, and Guanglu Sun. Multiview Similarity Learning for Robust Visual Clustering


Beijing 18:40~19:00

Yuanzhong Liu, Zhigang Tu, Liyu Lin, Xing xie, Qianqing Qin. Real-time spatio-temporal action localization via learning motion representation


Session Two: 21:00-23:00 (Beijing, China Standard Time, UTC+8)

#1 21:00~21:40(Beijing, China Standard Time, UTC+8): Invited Talk (Dr. Jiaying Liu)**

Talk: Intelligent Action Analytics with Multi-Modality Data


Talk abstract: Intelligent action analytics is an important task in computer vision. With the development of technology, multi-modal data become more accessible, including RGB videos, depth maps, IR images and human skeletons, which can adapt to different application scenario. However, processing RGB videos can be very time-consuming and require a large storage space, while human skeletons draw much attention due to the light-weight high-level representations for human behavior. Furthermore, it is still under explored how to utilize multi-modal data to help action understanding.

In this talk, I will discuss three aspects of action analytics. First, I will focus on skeleton-based action analytics. With the help of different methods, skeleton data can achieve remarkable performance for action recognition efficiently. It is also a trial to explore multi-task self-supervised learning and different training strategies to learn feature representations from skeleton data with less annotation efforts. While some meaningful information may get lost in skeletons, I will introduce in the second part of the talk multi-model action analytics that integrates different modal data from multiple perspectives. Furthermore, we collect a multi-modal dataset, PKU-MMD, to investigate multi-modal action understanding.


Bio: Jiaying Liu is currently an Associate Professor with the Wangxuan Institute of Computer Technology, Peking University. She received the Ph.D. degree (Hons.) in computer science from Peking University, Beijing China, 2010. She has authored over 100 technical articles in refereed journals and proceedings, and holds 43 granted patents. Her current research interests include multimedia signal processing, compression, and computer vision.

Dr. Liu is a Senior Member of IEEE, CSIG and CCF. She was a Visiting Scholar with the University of Southern California, Los Angeles, from 2007 to 2008. She was a Visiting Researcher with the Microsoft Research Asia in 2015 supported by the Star Track Young Faculties Award. She has served as a member of Membership Services Committee in IEEE Signal Processing Society, a member of Multimedia Systems & Applications Technical Committee (MSA TC), Visual Signal Processing and Communications Technical Committee (VSPC TC) in IEEE Circuits and Systems Society, a member of the Image, Video, and Multimedia (IVM) Technical Committee in APSIPA. She received the IEEE ICME-2020 Best Paper Awards and IEEE MMSP-2015 Top10% Paper Awards. She has also served as the Associate Editor of IEEE Trans. on Image Processing, and Elsevier JVCI, the Technical Program Chair of IEEE ICME-2021/ACM ICMR-2021, the Publicity Chair of IEEE ICME-2020/ICIP-2019, and the Area Chair of CVPR-2021/ECCV-2020/ICCV-2019. She was the APSIPA Distinguished Lecturer (2016-2017).


#2 Beijing 21:40~22:20: Invited Talk (Dr. Jun Liu)

Talk: Context modeling for human behavior understanding


Talk abstract: Human action understanding is an important and hot research problem due to its wide applications in security surveillance, self-driving vehicles, robotics, and human-machine interaction. Spatio-temporal context modeling and learning is crucial for this task. In this seminar, several deep learning architectures for human action recognition will be introduced, which include networks on modeling the spatio-temporal context dependencies in action video sequences, frameworks on selecting the most important and proper context for action analysis, and mechanisms on improving robustness of deep networks by taking advantage of the spatio-temporal context information.


Bio: Jun Liu has been an assistant professor with Singapore University of Technology and Design since 2019. He received the PhD degree from Nanyang Technological University in 2019, the MSc degree from Fudan University in 2014, and the BEng degree from Central South University in 2011. He was with Tencent from 2014 to 2015. His research interests include computer vision and artificial intelligence. His works have been published in premier computer vision journals and conferences, including TPAMI, CVPR, ICCV, and ECCV, and some works are highly cited. He received several best paper awards from the Pattern Recognition and Machine Intelligence Association of Singapore for his works on video analytics and human activity understanding. He obtained the EEE Best Thesis Award, Nanyang Technological University in 2020.


#3 Beijing 22:20~23:00: Invited Talk (Dr. Juergen Gall)

Talk: Understanding and Anticipating Activities


Talk abstract: In this talk, I will discuss three aspects of video understanding. In the first part, I will discuss holistic video understanding where the video is not only categorized by the activity, but also by other tags that describe the scene, present objects, or attributes. I will discuss how the additional tags help to recognize activities or caption videos. While holistic video understanding considers so far only short video clips, I will introduce in the second part of the talk a multi-stage temporal convolutional network that temporally segments activities in long videos as they need to be processed in many applications like surveillance or robotics. For close human-robot collaborations, however, just analyzing the past is not enough and the robots need to anticipate the future activities of the human. In the last part of the talk, I will therefore discuss how future activities can be anticipated.


Bio: Prof. Dr. Juergen Gall is professor and head of the Computer Vision Group at the University of Bonn since 2013. After his Ph.D. in computer science from the Saarland University and the Max Planck Institute for Informatics, he was a postdoctoral researcher at the Computer Vision Laboratory, ETH Zurich, from 2009 until 2012 and senior research scientist at the Max Planck Institute for Intelligent Systems in Tübingen from 2012 until 2013. He received a grant for an independent Emmy Noether research group from the German Research Foundation (DFG) in 2013, the German Pattern Recognition Award of the German Association for Pattern Recognition (DAGM) in 2014, and an ERC Starting Grant in 2016. He is further spokesperson of the DFG funded research unit "Anticipating Human Behavior" ( and PI of the Cluster of Excellence "PhenoRob – Robotics and Phenotyping for Sustainable Crop Production" (