We introduce MedVidBench, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks. We introduce MedGRPO, a novel RL framework for balanced multi-dataset training with cross-dataset reward normalization and a medical LLM judge that evaluates caption quality on five clinical dimensions.
@article{medgrpo2026,title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},year={2026},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
Consistent Instance Field for Dynamic Scene Understanding
We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution.
@article{consistent2026,title={Consistent Instance Field for Dynamic Scene Understanding},author={Wu, Junyi and Nguyen, Van Nguyen and Planche, Benjamin and Tao, Jiachen and Sun, Changchang and Gao, Zhongpai and Zhao, Zhenghao and Choudhuri, Anwesa and Zhang, Gengyu and Zheng, Meng and Wang, Feiran and Chen, Terrence and Yan, Yan and Wu, Ziyan},year={2026},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
We introduce Universal Beta Splatting (UBS), a unified framework that generalizes 3D Gaussian Splatting to N-dimensional anisotropic Beta kernels for explicit radiance field rendering. Unlike fixed Gaussian primitives, Beta kernels enable controllable dependency modeling across spatial, angular, and temporal dimensions within a single representation.
@article{universal2026,title={Universal Beta Splatting},author={Liu, Rong and Gao, Zhongpai and Planche, Benjamin and Chen, Meida and Nguyen, Van Nguyen and Zheng, Meng and Choudhuri, Anwesa and Chen, Terrence and Wang, Yue and Feng, Andrew and Wu, Ziyan},year={2026},journal={International Conference on Learning Representations (ICLR)},}
2025
highlight
CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image
We propose a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input.
@article{chrome2025,title={CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image},author={Dutta, Arindam and Zheng, Meng and Gao, Zhongpai and Planche, Benjamin and Choudhuri, Anwesa and Roy-Chowdhury, Amit K. and Chen, Terrence and Wu, Ziyan},year={2025},journal={IEEE/CVF International Conference on Computer Vision (ICCV)},note={highlight},}
We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering on challenging dynamic scenes.
@article{7dgs2025,title={7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting},author={Gao, Zhongpai and Planche, Benjamin and Zheng, Meng and Choudhuri, Anwesa and Chen, Terrence and Wu, Ziyan},year={2025},journal={IEEE/CVF International Conference on Computer Vision (ICCV)},}
PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis
We introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos.
@article{polypsegtrack2025,title={PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis},author={Choudhuri, Anwesa and Gao, Zhongpai and Zheng, Meng and Planche, Benjamin and Chen, Terrence and Wu, Ziyan},year={2025},journal={Medical Image Computing and Computer Assisted Intervention (MICCAI)},}
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
We propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform largescale image and clip captioning datasets into sequences that mimic the temporal structure of long videos.
@article{seq2time2025,title={Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding},author={Deng, Andong and Gao, Zhongpai and Choudhuri, Anwesa and Planche, Benjamin and Zheng, Meng and Wang, Bin and Chen, Chen and Chen, Terrence and Wu, Ziyan},year={2025},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering
We introduce 6D Gaussian Splatting (6DGS), which enhances color and opacity representations and leverages the additional directional information in the 6D space for optimized Gaussian control. Our approach significantly improves real-time radiance field rendering by better modeling view-dependent effects and fine details.
@article{6dgs2025,title={6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering},author={Gao, Zhongpai and Planche, Benjamin and Zheng, Meng and Choudhuri, Anwesa and Chen, Terrence and Wu, Ziyan},year={2025},journal={International Conference on Learning Representations (ICLR)},}
We propose a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization.
@article{3dvlgs2025,title={3D Vision-Language Gaussian Splatting},author={Peng, Qucheng and Planche, Benjamin and Gao, Zhongpai and Zheng, Meng and Choudhuri, Anwesa and Chen, Terrence and Chen, Chen and Wu, Ziyan},year={2025},journal={International Conference on Learning Representations (ICLR)},}
We introduce a novel order-aware attention, where the order maps seamlessly guide the user interactions to attend to the image features. Our approach allows both dense and sparse integration of user clicks, enhancing both accuracy and efficiency as compared to prior works.
@article{ois2025,title={Order-aware Interactive Segmentation},author={Wang, Bin and Choudhuri, Anwesa and Zheng, Meng and Gao, Zhongpai and Planche, Benjamin and Deng, Andong and Liu, Qin and Chen, Terrence and Bagci, Ulas and Wu, Ziyan},year={2025},journal={International Conference on Learning Representations (ICLR)},}
Automated Patient Positioning with Learned 3D Hand Gestures
We propose an automated patient positioning system that utilizes a camera to detect specific hand gestures from technicians, allowing users to indicate the target patient region to the system and initiate automated positioning. Our approach relies on a novel multi-stage pipeline to recognize and interpret the technicians’ gestures, translating them into precise motions of medical devices.
@article{patient2025,title={Automated Patient Positioning with Learned 3D Hand Gestures},author={Gao, Zhongpai and Sharma, Abhishek and Zheng, Meng and Planche, Benjamin and Chen, Terrence and Wu, Ziyan},year={2025},journal={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},}
2024
DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume Rendering
We present a novel approach that marries realistic physics-inspired X-ray simulation with efficient, differentiable DRR generation using 3D Gaussian splatting (3DGS). Our direction-disentangled 3DGS (DDGS) method separates the radiosity contribution into isotropic and direction-dependent components, approximating complex anisotropic interactions without intricate runtime simulations.
@article{ddgsct2024,title={DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume Rendering},author={Gao, Zhongpai and Planche, Benjamin and Zheng, Meng and Chen, Xiao and Chen, Terrence and Wu, Ziyan},year={2024},journal={Annual Conference on Neural Information Processing Systems (NeurIPS)},}
Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images
We introduce a novel bottom-up approach for human body mesh reconstruction, specifically designed to address the challenges posed by partial visibility and occlusion in input images.
@article{divide2024,title={Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images},author={Luan, Tianyu and Gao, Zhongpai and Xie, Luyuan and Sharma, Abhishek and Ding, Hao and Planche, Benjamin and Zheng, Meng and Lou, Ange and Chen, Terrence and Yuan, Junsong and Wu, Ziyan},year={2024},journal={European Conference on Computer Vision (ECCV)},}
early accept
Few-Shot 3D Volumetric Segmentation with Multi-Surrogate Fusion
We present MSFSeg, a novel few-shot 3D segmentation framework with a lightweight multi-surrogate fusion (MSF). MSFSeg is able to automatically segment unseen 3D objects/organs (during training) provided with one or a few annotated 2D slices or 3D sequence segments.
@article{msfseg2024,title={Few-Shot 3D Volumetric Segmentation with Multi-Surrogate Fusion},author={Zheng, Meng and Planche, Benjamin and Gao, Zhongpai and Chen, Terrence and Radke, Richard J. and Wu, Ziyan},year={2024},journal={Medical Image Computing and Computer Assisted Intervention (MICCAI)},note={early accept},}
Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models
This work addresses the issue of cross-class domain adaptation (CCDA) in semantic segmentation, where the target domain contains both shared and novel classes that are either unlabeled or unseen in the source domain.
@article{ccda2024,title={Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models},author={Ren, Wenqi and Xia, Ruihao and Zheng, Meng and Wu, Ziyan and Tang, Yang and Sebe, Nicu},year={2024},journal={ACM Multimedia Conference (MM)},}
DaReNeRF: Direction-aware Representation for Dynamic Scenes
We present a novel direction-aware representation (DaRe) approach that captures scene dynamics from six different directions. This learned representation undergoes an inverse dual-tree complex wavelet transformation (DTCWT) to recover plane-based information.
@article{darenerf2024,title={DaReNeRF: Direction-aware Representation for Dynamic Scenes},author={Lou, Ange and Planche, Benjamin and Gao, Zhongpai and Li, Yamin and Luan, Tianyu and Ding, Hao and Chen, Terrence and Noble, Jack and Wu, Ziyan},year={2024},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
The 2nd AAAI Workshop on Artificial Intelligence with Biased or Scarce Data (AIBSD)
The official proceedings of the Second Workshop on Artificial Intelligence with Biased or Scarce Data in conjunction with AAAI Conference on Artificial Intelligence 2024.
@article{aibsdworkshop2024,title={The 2nd AAAI Workshop on Artificial Intelligence with Biased or Scarce Data (AIBSD)},author={Peng, Kuan-Chuan and Aich, Abhishek and Wu, Ziyan},year={2024},journal={MDPI Comput. Sci. Math. Forum},volume={9},number={1},}
PBADet: A One-Stage Anchor-Free Approach for Part-Body Association
We presents PBADet, a novel one-stage, anchor-free approach for part-body association detection. Building upon the anchor-free object representation across multi-scale feature maps, we introduce a singular part-to-body center offset that effectively encapsulates the relationship between parts and their parent bodies.
@article{pbadet2024,title={PBADet: A One-Stage Anchor-Free Approach for Part-Body Association},author={Gao, Zhongpai and Zhou, Huayi and Sharma, Abhishek and Zheng, Meng and Planche, Benjamin and Chen, Terrence and Wu, Ziyan},year={2024},journal={International Conference on Learning Representations (ICLR)},}
Implicit Modeling of Non-rigid Objects with Cross-Category Signals
In this work, we propose MODIF, a multi-object deep implicit function that jointly learns the deformation fields and instance-specific latent codes for multiple objects at once. Our emphasis is on non-rigid, non-interpenetrating entities such as organs.
@article{implicit2024,title={Implicit Modeling of Non-rigid Objects with Cross-Category Signals},author={Liu, Yuchun and Planche, Benjamin and Zheng, Meng and Gao, Zhongpai and Sibut-Bourde, Pierre and Yang, Fan and Chen, Terrence and Wu, Ziyan},year={2024},journal={AAAI Conference on Artificial Intelligence (AAAI)},}
Disguise without Disruption: Utility-Preserving Face De-Identification
In this paper, we introduce Disguise, a novel algorithm that seamlessly de-identifies facial images while ensuring the usability of the modified data. Our solution is firmly grounded in the domains of differential privacy and ensemble-learning research.
@article{disguise2024,title={Disguise without Disruption: Utility-Preserving Face De-Identification},author={Cai, Zikui and Gao, Zhongpai and Planche, Benjamin and Zheng, Meng and Chen, Terrence and Asif, M. Salman and Wu, Ziyan},year={2024},journal={AAAI Conference on Artificial Intelligence (AAAI)},}
Federated Learning via Input-Output Collaborative Distillation
We propose a federated learning framework eliminating any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users.
@article{federated2024,title={Federated Learning via Input-Output Collaborative Distillation},author={Gong, Xuan and Li, Shanglin and Bao, Yuxiang and Yao, Barry and Huang, Yawen and Wu, Ziyan and Zhang, Baochang and Zheng, Yefeng and Doermann, David},year={2024},journal={AAAI Conference on Artificial Intelligence (AAAI)},}
2023
CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation
Event cameras, as a new form of vision sensors, are complementary to conventional cameras with their high dynamic range. We propose a novel unsupervised Cross-Modality Domain Adaptation (CMDA) framework to leverage multi-modality (Images and Events) information for nighttime semantic segmentation.
@article{cmda2023,title={CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation},author={Xia, Ruihao and Zhao, Chaoqiang and Zheng, Meng and Wu, Ziyan and Sun, Qiyu and Tang, Yang},year={2023},journal={IEEE/CVF International Conference on Computer Vision (ICCV)},}
oral
Progressive Multi-view Human Mesh Recovery with Self-Supervision
We propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space.
@article{progressive2023,title={Progressive Multi-view Human Mesh Recovery with Self-Supervision},author={Gong, Xuan and Song, Liangchen and Zheng, Meng and Planche, Benjamin and Chen, Terrence and Yuan, Junsong and Doermann, David and Wu, Ziyan},year={2023},journal={AAAI Conference on Artificial Intelligence (AAAI)},note={oral},}
2022
Federated Learning with Privacy-Preserving Ensemble Attention Distillation
We propose a privacy-preserving FL framework leveraging unlabeled public data for one-way offline knowledge distillation. The central model is learned from local knowledge via ensemble attention distillation.
@article{flppead2022,title={Federated Learning with Privacy-Preserving Ensemble Attention Distillation},author={Gong, Xuan and Song, Liangchen and Vedula, Rishi and Sharma, Abhishek and Zheng, Meng and Planche, Benjamin and Innanje, Arun and Chen, Terrence and Yuan, Junsong and Doermann, David and Wu, Ziyan},year={2022},journal={IEEE Transactions on Medical Imaging (TMI)},}
We introduce a novel framework Scene History Excavating Network (SHENet), where the scene history is leveraged in a simple yet effective approach to forecast a person’s future trajectory.
@article{trajectory2022,title={Forecasting Human Trajectory from Scene History},author={Meng, Mancheng and Wu, Ziyan and Chen, Terrence and Cai, Xiran and Zhou, Xiang Sean and Yang, Fan and Shen, Dinggang},year={2022},journal={Annual Conference on Neural Information Processing Systems (NeurIPS)},note={spotlight},}
We leverage a neural motion field for estimating the motion of all points in a multiview setting. We propose to regularize the estimated motion to be predictable.
@article{pref2022,title={PREF: Predictability Regularized Neural Motion Fields},author={Song, Liangchen and Gong, Xuan and Planche, Benjamin and Zheng, Meng and Doermann, David and Yuan, Junsong and Chen, Terrence and Wu, Ziyan},year={2022},journal={European Conference on Computer Vision (ECCV)},note={oral},}
Self-supervised Human Mesh Recovery with Cross-Representation Alignment
We propose cross-representation alignment utilizing the complementary information from the robust but sparse representation (2D keypoints). Specifically, the alignment errors between initial mesh estimation and both 2D representations are forwarded into regressor and dynamically corrected.
@article{cra2022,title={Self-supervised Human Mesh Recovery with Cross-Representation Alignment},author={Gong, Xuan and Zheng, Meng and Planche, Benjamin and Karanam, Srikrishna and Chen, Terrence and Doermann, David and Wu, Ziyan},year={2022},journal={European Conference on Computer Vision (ECCV)},}
PseudoClick: Interactive Image Segmentation with Click Imitation
We propose PseudoClick, a generic framework that enables existing segmentation networks to propose candidate next clicks as an imitation of human clicks to refine the segmentation mask.
@article{pseudoclick2022,title={PseudoClick: Interactive Image Segmentation with Click Imitation},author={Liu, Qin and Zheng, Meng and Planche, Benjamin and Karanam, Srikrishna and Chen, Terrence and Niethammer, Marc and Wu, Ziyan},year={2022},journal={European Conference on Computer Vision (ECCV)},}
early accept
Self-supervised 3D Patient Modeling with Multi-modal Attentive Fusion
We propose a generic modularized 3D patient modeling method consists of (a) a multi-modal keypoint detection module with attentive fusion; and (b) a self-supervised 3D mesh regression module.
@article{patientmodeling2022,title={Self-supervised 3D Patient Modeling with Multi-modal Attentive Fusion},author={Zheng, Meng and Planche, Benjamin and Gong, Xuan and Yang, Fan and Chen, Terrence and Wu, Ziyan},year={2022},journal={Medical Image Computing and Computer Assisted Intervention (MICCAI)},note={early accept},}
We propose the first method to generate generic visual similarity explanations with gradient-based attention.
@article{similarityattention2022,title={Visual Similarity Attention},author={Zheng, Meng and Karanam, Srikrishna and Chen, Terrence and Radke, Richard J. and Wu, Ziyan},year={2022},journal={International Joint Conference on Artificial Intelligence (IJCAI)},}
We present the first learning-based approach to estimate the patient’s internal organ deformation for arbitrary human poses.
@article{smpla2022,title={SMPL-A: Modeling Person-Specific Deformable Anatomy},author={Guo, Hengtao and Planche, Benjamin and Zheng, Meng and Karanam, Srikrishna and Chen, Terrence and Wu, Ziyan},year={2022},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
AAAI Workshop on Artificial Intelligence with Biased or Scarce Data (AIBSD)
The official proceedings of the Workshop on Artificial Intelligence with Biased or Scarce Data in conjunction with AAAI Conference on Artificial Intelligence 2022.
@article{aibsdworkshop2022,title={AAAI Workshop on Artificial Intelligence with Biased or Scarce Data (AIBSD)},author={Peng, Kuan-Chuan and Wu, Ziyan},year={2022},journal={MDPI Comput. Sci. Math. Forum},}
Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation
We propose a quantized and noisy ensemble of local predictions from completely trained local models for stronger privacy guarantees without sacrificing accuracy. Based on extensive experiments on classification and segmentation tasks, we show that our method outperforms baseline FL algorithms with superior performance in both accuracy and data privacy preservation.
@article{preserving2022162,title={Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation},author={Gong, Xuan and Sharma, Abhishek and Karanam, Srikrishna and Wu, Ziyan and Chen, Terrence and Doermann, David and Innanje, Arun},year={2022},journal={AAAI Conference on Artificial Intelligence (AAAI)},}
Multi-motion and Appearance Self-Supervised Moving Object Detection
We propose a Multi-motion and Appearance Self-supervised Network (MASNet) to introduce multi-scale motion information and appearance information of scene for MOD. Introducing multi-scale motion can aggregate these regions to form a more complete detection. Appearance information can serve as another cue for MOD when the motion independence is not reliable and for removing false detection in background caused by locally independent background motion.
@article{multimotion2022157,title={Multi-motion and Appearance Self-Supervised Moving Object Detection},author={Yang, Fan and Karanam, Srikrishna and Zheng, Meng and Chen, Terrence and Ling, Haibin and Wu, Ziyan},year={2022},journal={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},}
Zero-shot Deep Domain Adaptation with Common Representation Learning
We proposed zero-shot deep domain adaptation (ZDDA). ZDDA-C/ML learns to generate common representations for source and target domains data. Then, either domain representation is used later to train a system that works on both domains or having the ability to eliminate the need to either domain in sensor fusion settings. In this paper, two variants of ZDDA have been developed for classification and metric learning task respectively.
@article{zeroshot2022114,title={Zero-shot Deep Domain Adaptation with Common Representation Learning},author={Kutbi, Mohammed and Peng, Kuan-Chuan and Wu, Ziyan},year={2022},journal={IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), Vol. 44, No.
7, pp. 3909-3924},}
2021
Everybody Is Unique: Towards Unbiased Human Mesh Recovery
We present a generalized human mesh optimization algorithm that substantially improves the performance of existing methods on both obese person images as well as community-standard benchmark datasets. The proposed method utilizes only 2D annotations without relying on supervision from expensive-to-create mesh parameters.
@article{everybody2021140,title={Everybody Is Unique: Towards Unbiased Human Mesh Recovery},author={Li, Ren and Karanam, Srikrishna and Zheng, Meng and Chen, Terrence and Wu, Ziyan},year={2021},journal={British Machine Vision Conference (BMVC)(oral)},}
Learning Local Recurrent Models for Human Mesh Recovery
We present a new method for video mesh recovery that divides the human mesh into several local parts following the standard skeletal model. We then model the dynamics of each local part with separate recurrent models, with each model conditioned appropriately based on the known kinematic structure of the human body.
@article{learning2021149,title={Learning Local Recurrent Models for Human Mesh Recovery},author={Li, Runze and Karanam, Srikrishna and Li, Ren and Chen, Terrence and Bhanu, Bir and Wu, Ziyan},year={2021},journal={International Conference on 3D Vision (3DV)},}
Ensemble Attention Distillation for Privacy-Preserving Federated Learning
We propose a new distillation-based FL framework that can preserve privacy by design, while also consuming substantially less network communication resources when compared to the current methods. Our framework engages in inter-node communication using only publicly available and approved datasets, thereby giving explicit privacy control to the user. To distill knowledge among the various local models, our framework involves a novel ensemble distillation algorithm that uses both final prediction as well as model attention.
@article{ensemble2021187,title={Ensemble Attention Distillation for Privacy-Preserving Federated Learning},author={Gong, Xuan and Sharma, Abhishek and Karanam, Srikrishna and Wu, Ziyan and Chen, Terrence and Doermann, David and Innanje, Arun},year={2021},journal={IEEE/CVF International Conference on Computer Vision (ICCV)},}
Spatio-Temporal Representation Factorization for Video-based Person Re-Identification
We propose Spatio-Temporal Representation Factorization (STRF), a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. The key innovations of STRF over prior work include explicit pathways for learning discriminative temporal and spatial features, with each component further factorized to capture complementary person-specific appearance and motion information. Specifically, temporal factorization comprises two branches, one each for static features (e.g., the color of clothes) that do not change much over time, and dynamic features (e.g., walking patterns) that change over time.
@article{spatiotemporal2021147,title={Spatio-Temporal Representation Factorization for Video-based Person Re-Identification},author={Aich, Abhishek and Zheng, Meng and Karanam, Srikrishna and Chen, Terrence and Roy-Chowdhury, Amit K. and Wu, Ziyan},year={2021},journal={IEEE/CVF International Conference on Computer Vision (ICCV)},}
A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts
We propose a novel framework to interpret neural networks which extracts relevant class-specific visual concepts and organizes them using structural concepts graphs based on pairwise concept relationships. By means of knowledge distillation, we show VRX can take a step towards mimicking the reasoning process of NNs and provide logical, concept-level explanations for final model decisions. With extensive experiments, we empirically show VRX can meaningfully answer “why” and “why not” questions about the prediction, providing easy-to-understand insights about the reasoning process. We also show that these insights can potentially provide guidance on improving NN’s performance.
@article{a2021136,title={A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts},author={Ge, Yunhao and Xiao, Yao and Xu, Zhi and Zheng, Meng and Karanam, Srikrishna and Chen, Terrence and Itti, Laurent and Wu, Ziyan},year={2021},journal={IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR)},}
Learning Hierarchical Attention for Weakly-supervised Chest X-Ray Abnormality Localization and Diagnosis
We propose a new attention-driven weakly supervised algorithm comprising a hierarchical attention mining framework that unifies activation- and gradient-based visual attention in a holistic manner. Our key algorithmic innovations include the design of explicit ordinal attention constraints, enabling principled model training in a weakly-supervised fashion, while also facilitating the generation of visual-attention-driven model explanations by means of localization cues.
@article{learning2021132,title={Learning Hierarchical Attention for Weakly-supervised Chest X-Ray Abnormality Localization and Diagnosis},author={Ouyang, Xi and Karanam, Srikrishna and Chen, Ziyan Wu,Terrence and Huo, Jiayu and Zhou, Xiang Sean and Wang, Qian and Cheng, Jie-Zhi},year={2021},journal={IEEE Transactions on Medical Imaging (TMI),
Vol. 40, No. 10, pp. 2698-2710},}
This paper considers the problem of 3D patient body modeling. Such a 3D model provides valuable information for improving patient care, streamlining clinical workflow, automated parameter optimization for medical devices etc. We present a novel robust dynamic fusion technique that facilitates flexible multi-modal inference, resulting in accurate 3D body modeling even when the input sensor modality is only a subset of the training modalities.
@article{robust2020,title={Robust Multi-modal 3D Patient Body Modeling},author={Yang, Fan and Li, Ren and Georgakis, Georgios and Karanam, Srikrishna and Chen, Terrence and Ling, Haibin and Wu, Ziyan},year={2020},journal={Medical Image Computing and Computer Assisted Intervention (MICCAI)},}
The COVID-19 pandemic, caused by the highly contagious SARS-CoV-2 virus, has overwhelmed healthcare systems worldwide, putting medical professionals at a high risk of getting infected themselves due to a global shortage of personal protective equipment. To help alleviate this problem, we design and develop a contactless patient positioning system that can enable scanning patients in a completely remote and contactless fashion. Our key design objective is to reduce the physical contact time with a patient as much as possible, which we achieve with our contactless workflow.
@article{towards2020,title={Towards Contactless Patient Positioning},author={Yang, Fan and Karanam, Srikrishna and Li, Ren and Hu, Wei and Chen, Terrence and Wu, Ziyan},year={2020},journal={IEEE Transations on Medical Imaging (TMI), Vol. 39, No. 8, pp. 2701-2710},}
Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation and Diagnosis for COVID-19
We cover the entire pipeline of medical imaging and analysis techniques involved with COVID-19, including image acquisition, segmentation, diagnosis, and follow-up. We particularly focus on the integration of AI with X-ray and CT, both of which are widely used in the frontline hospitals, in order to depict the latest progress of medical imaging and radiology fighting against COVID-19.
@article{review2020,title={Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation and Diagnosis for COVID-19},author={Shi, Feng and Wang, Jun and Shi, Jun and Wu, Ziyan and Wang, Qian and Tang, Zhenyu and He, Kelei and Shi, Yinghuan and Shen, Dinggang},year={2020},journal={IEEE Reviews in Biomedical Engineering (RBME), Vol. 14, pp. 4-15},}
oral
Towards Visually Explaining Variational Autoencoders
We propose the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, and how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement.
@article{towardsvae2020,title={Towards Visually Explaining Variational Autoencoders},author={Liu, Wenqian and Li, Runze and Zheng, Meng and Karanam, Srikrishna and Wu, Ziyan and Bhanu, Bir and Radke, Richard J. and Camps, Octavia},year={2020},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},note={oral},}
In this work, we address this gap by proposing a new technique for regression of human parametric model that is explicitly informed by the known hierarchical structure, including joint interdependencies of the model. This results in a strong prior-informed design of the regressor architecture and an associated hierarchical optimization that is flexible to be used in conjunction with the current standard frameworks for 3D human mesh recovery. *Equal Contributions
@article{hierarchical2020100,title={Hierarchical
Kinematic Human Mesh Recovery},author={Georgakis, Georgios and Li, Ren and Karanam, Srikrishna and Chen, Terrence and Kosecka, Jana and Wu, Ziyan},year={2020},journal={European Conference on Computer Vision (ECCV)},}
This is an extension of our CVPR 18 work with added support of bounding box labels seamlessly integrated with image level and pixel level labels for weakly supervised semantic segmentation.
@article{guided2020180,title={Guided Attention Inference Network},author={Li, Kunpeng and Wu, Ziyan and Peng, Kuan-Chuan and Ernst, Jan and Fu, Yun},year={2020},journal={IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), Vol. 42, No. 12, pp. 2996-3010},}
Knowledge distillation should not only focus on "what", but also "why". We peoposed an online learning method to preserve the exisiting knowledge without storing any data.
@article{memorizing2019,title={Learning without Memorizing},author={Dhar, Prithviraj and Singh, Rajat Vikram and Peng, Kuan-Chuan and Wu, Ziyan and Chellappa, Rama},year={2019},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
Re-identification with Consistent Attentive Siamese Networks
We proposed the first learning architecture that integrates attention consistency modeling and Siamese representation learning in a joint learning framework for person re-id.
@article{casn2019,title={Re-identification with Consistent Attentive Siamese Networks},author={Zheng, Meng and Karanam, Srikrishna and Wu, Ziyan and Radke, Richard J.},year={2019},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
A technique to produce counterfactual visual explanations. Given a ’query’ image I for which a vision system predicts class c, a counterfactual visual explanation identifies how I could change such that the system would output a different specified class c’.
@article{counterfactual2019,title={Counterfactual Visual Explanations},author={Goyal, Yash and Wu, Ziyan and Ernst, Jan and Batra, Dhruv and Parikh, Devi and Lee, Stefan},year={2019},journal={International Conference on Machine Learning (ICML)},}
A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
We present an extensive review and performance evaluation of single and multi-shot re-id algorithms based on a new large-scale dataset.
@article{benchmark2019,title={A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets},author={Karanam, Srikrishna and Gou, Mengran and Wu, Ziyan and Rates-Borras, Angels and Camps, Octavia and Radke, Richard J.},year={2019},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},}
We present a method to incrementally generate complete 2D or 3D scenes with global consistentcy at each step according to a learned scene prior. Real observations of a scene can be incorporated while observing global consistency and unobserved regions can be hallucinated locally in consistence with previous observations, hallucinations as well as global priors. Hallucinations are statistical in nature, i.e., different scenes can be generated from the same observations.
@article{incremental2019,title={Incremental Scene Synthesis},author={Planche, Benjamin and Rong, Xuejian and Wu, Ziyan and Karanam, Srikrishna and Kosch, Harald and Tian, YingLi and Ernst, Jan and Hutter, Andreas},year={2019},journal={Annual Conference on Neural Information Processing Systems (NeurIPS)},}
Sharpen Focus: Learning with Attention Separability and Consistency
We improve the generalizability of CNNs by means of a new framework that makes class-discriminative attention a principled part of the learning process. We propose new learning objectives for attention separability and cross-layer consistency, which result in improved attention discriminability and reduced visual confusion.
@article{sharpen2019,title={Sharpen Focus: Learning with Attention Separability and Consistency},author={Wang, Lezi and Wu, Ziyan and Karanam, Srikrishna and Peng, Kuan-Chuan and Singh, Rajat Vikram and Liu, Bo and Metaxas, Dimitris N.},year={2019},journal={IEEE International Conference on Computer Vision (ICCV)},}
Learning Local RGB-to-CAD Correspondences for Object Pose Estimation
We solve the key problem of existing 3D object pose estimation methods requiring expensive 3D pose annotations by proposing a new method that matches RGB images to CAD models for object pose estimation. Our method requires neither real-world textures for CAD models nor explicit 3D pose annotations for RGB images.
@article{learning2019,title={Learning Local RGB-to-CAD Correspondences for Object Pose Estimation},author={Georgakis, Georgios and Karanam, Srikrishna and Wu, Ziyan and Kosecka, Jana},year={2019},journal={IEEE International Conference on Computer Vision (ICCV)},}
Seeing Beyond Appearance - Mapping Real Images into Geometrical Domains for Unsupervised CAD-based Recognition
We introduce a pipeline to map unseen target samples into the synthetic domain used to train task-specific methods. Denoising the data and retaining only the features these recognition algorithms are familiar with.
@article{seeing2019,title={Seeing Beyond Appearance - Mapping Real Images into Geometrical Domains for Unsupervised CAD-based Recognition},author={Planche, Benjamin and Zakharov, Sergey and Wu, Ziyan and Hutter, Andreas and Kosch, Harald and Ilic, Slobodan},year={2019},journal={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},}
We propose zero-shot deep domain adaptation (ZDDA) for domain adaptation and sensor fusion. ZDDA learns from the task-irrelevant dual-domain pairs when the task-relevant target-domain training data is unavailable.
@article{zdda2018,title={Zero Shot Deep Domain Adaptation},author={Peng, Kuan-Chuan and Wu, Ziyan and Ernst, Jan},year={2018},journal={European Conference on Computer Vision (ECCV)},}
spotlight
Tell Me Where To Look: Guided Attention Inference Network
We address three shortcomings of previous approaches in modeling attention maps: (1) making attention maps an explicit component of end-to-end training, (2) providing self-guidance directy on these maps, and (3) bridging the gap between weak and extra supervision.
@article{tellmewhere2018,title={Tell Me Where To Look: Guided Attention Inference Network},author={Li, Kunpeng and Wu, Ziyan and Peng, Kuan-Chuan and Ernst, Jan and Fu, Yun},year={2018},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},note={spotlight},}
spotlight
Learning Compositional Visual Concepts with Mutual Consistency
We proposed ConceptGAN, a novel concept learning framework where we seek to capture underlying semantic shifts between data domains instead of mappings restricted to training distributions.
@article{conceptgan2018,title={Learning Compositional Visual Concepts with Mutual Consistency},author={Gong, Yunye and Karanam, Srikrishna and Wu, Ziyan and Peng, Kuan-Chuan and Ernst, Jan and Doerschuk, Peter C.},year={2018},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},note={spotlight},}
End-to-End Learning of Keypoint Detector and Descriptor for Pose Invariant 3D Matching
We proposed an end-to-end learning framework for keypoint detection and its representation (descriptor) for 3D depth maps or 3D scans.
@article{e2ekeypoint2018,title={End-to-End Learning of Keypoint Detector and Descriptor for Pose Invariant 3D Matching},author={Georgakis, Georgios and Karanam, Srikrishna and Wu, Ziyan and Ernst, Jan and Kosecka, Jana},year={2018},journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},}
Learning Affine Hull Representations for Multi-Shot Person Re-Identification
We describe the image sequence data using affine hulls and incorporate affine hull data modeling into the traditional distance metric learning framework.
@article{affinehull2018,title={Learning Affine Hull Representations for Multi-Shot Person Re-Identification},author={Karanam, Srikrishna and Wu, Ziyan and Radke, Richard J.},year={2018},journal={IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)},volume={28},number={10},pages={2500--2512},}
oral
Keep it Unreal: Bridging the Realism Gap for 2.5D Recognition with Geometry Priors Only
We propose a novel approach leveraging only CAD models to bridge the realism gap. A GAN learns to effectively segment depth images and recover the clean synthetic-looking depth information.
@article{keepitunreal2018,title={Keep it Unreal: Bridging the Realism Gap for 2.5D Recognition with Geometry Priors Only},author={Zakharov, Sergey and Planche, Benjamin and Wu, Ziyan and Hutter, Andreas and Kosch, Harald and Ilic, Slobodan},year={2018},journal={International Conference on 3D Vision (3DV)},note={oral},}
We proposed a weakly supervised approach to summarize videos with only video-level annotation, introducing an effective method for computing spatio-temporal importance scores.
@article{videosum2017,title={Weakly Supervised Summarization of Web Videos},author={Panda, Rameswar and Das, Abir and Wu, Ziyan and Ernst, Jan and Roy-Chowdhury, Amit K.},year={2017},journal={IEEE International Conference on Computer Vision (ICCV)},}
oral
DepthSynth: Real-Time Realistic Synthetic Data Generation from CAD Models for 2.5D Recognition
We propose an end-to-end framework which simulates the whole mechanism of 3D sensors, generating realistic depth data from 3D models.
@article{depthsynth2017,title={DepthSynth: Real-Time Realistic Synthetic Data Generation from CAD Models for 2.5D Recognition},author={Planche, Benjamin and Wu, Ziyan and Ma, Kai and Sun, Shanhui and Kluckner, Stefan and Chen, Terrence and Hutter, Andreas and Kosch, Harald and Ernst, Jan},year={2017},journal={International Conference on 3D Vision (3DV)},note={oral},}
We present a method to track vessels in angiography. The vessel tree tracking problem is solved using an efficient dynamic programming algorithm.
@article{vesseltree2017,title={Vessel Tree Tracking in Angiographic Sequences},author={Zhang, Dong and Sun, Shanhui and Wu, Ziyan and Chen, Bor-Jeng and Chen, Terrence},year={2017},journal={Journal of Medical Imaging (JMI)},volume={4},number={2},pages={025001},}
From the Lab to the Real World: Re-Identification in an Airport Camera Network
We detail the challenges of the real-world airport environment, the computer vision algorithms underlying our human detection and re-identification algorithms.
@article{labtorealistic2017,title={From the Lab to the Real World: Re-Identification in an Airport Camera Network},author={Camps, Octavia and Gou, Mengran and Hebble, Tom and Karanam, Srikrishna and Lehmann, Oliver and Li, Yang and Radke, Richard J. and Wu, Ziyan and Xiong, Fei},year={2017},journal={IEEE Transactions on Circuits and Systems for Video Technology (TCSV)},volume={27},number={3},pages={540--553},}
This book covers aspects of human re-identification problems related to computer vision and machine learning, bridging the gap between research and reality.
We model the wire-like structure as a sequence of small segments and formulate guidewire tracking as a graph-based optimization problem.
@article{guidewire2016,title={Guidewire Tracking Using a Novel Sequential Segment Optimization Method in Interventional X-Ray Videos},author={Chen, Bor-Jeng and Wu, Ziyan and Sun, Shanhui and Zhang, Dong and Chen, Terrence},year={2016},journal={IEEE International Symposium on Biomedical Imaging (ISBI)},}
2015
Viewpoint Invariant Human Re-Identification in Camera Networks Using Pose Priors and Subject-Discriminative Features
We build a model for human appearance as a function of pose, using training data gathered from a calibrated camera.
@article{poseprior2015,title={Viewpoint Invariant Human Re-Identification in Camera Networks Using Pose Priors and Subject-Discriminative Features},author={Wu, Ziyan and Li, Yang and Radke, Richard J.},year={2015},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},volume={37},number={5},pages={1095--1108},}
Multi-Shot Human Re-Identification Using Adaptive Fisher Discriminant Analysis
We introduce an algorithm to hierarchically cluster image sequences and use the representative data samples to learn a feature subspace maximizing the Fisher criterion.
@article{adaptivefisher2015,title={Multi-Shot Human Re-Identification Using Adaptive Fisher Discriminant Analysis},author={Li, Yang and Wu, Ziyan and Karanam, Srikrishna and Radke, Richard J.},year={2015},journal={British Machine Vision Conference (BMVC)},}
Multi-Shot Re-identification with Random-Projection-based Random Forest
We perform dimensionality reduction on image feature vectors through random projection for multi-shot Re-ID.
@article{randomforest2015,title={Multi-Shot Re-identification with Random-Projection-based Random Forest},author={Li, Yang and Wu, Ziyan and Radke, Richard J.},year={2015},journal={IEEE Winter Conference on Applications of Computer Vision (WACV)},}
2014
Multi-Object Tracking and Association With a Camera Network
This thesis investigates several important and challenging computer vision problems related to system calibration, multi-object tracking, and target behavior analysis.
@phdthesis{phdthesis2014,title={Multi-Object Tracking and Association With a Camera Network},author={Wu, Ziyan},year={2014},school={Rensselaer Polytechnic Institute (RPI)},}
oral
Virtual Insertion: Robust Bundle Adjustment over Long Video Sequences
We propose a novel "virtual insertion" scheme for Structure from Motion (SfM), which constructs virtual points and virtual frames to adapt the existence of visual landmark link outage.
@article{virtualinsertion2014,title={Virtual Insertion: Robust Bundle Adjustment over Long Video Sequences},author={Wu, Ziyan and Chiu, Han-Pang and Zhu, Zhiwei},year={2014},journal={British Machine Vision Conference (BMVC)},note={oral},}
Improving Counterflow Detection in Dense Crowds with Scene Features
This paper addresses the problem of detecting counterflow motion in videos of highly dense crowds by identifying scene features.
@article{counterflow2014,title={Improving Counterflow Detection in Dense Crowds with Scene Features},author={Wu, Ziyan and Radke, Richard J.},year={2014},journal={Pattern Recognition Letters (PRL)},volume={44},pages={152--160},}
Real-World Re-Identification in an Airport Camera Network
We discuss the high-level system design of the video surveillance application, and the issues we encountered during our development and testing. We also describe the algorithm framework for our human re-identification software, and discuss considerations of speed and matching performance.
@article{realworld2014194,title={Real-World Re-Identification in an Airport Camera Network},author={Li, Yang and Wu, Ziyan and Karanam, Srikrishna and Radke, Richard J.},year={2014},journal={ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC)},}
We propose a complete model for a pan-tilt-zoom camera that explicitly reflects how focal length and lens distortion vary as a function of zoom scale.
@article{calibrated2013,title={Keeping a PTZ Camera Calibrated},author={Wu, Ziyan and Radke, Richard J.},year={2013},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},volume={35},number={8},pages={1994--2007},}
2012
Using Scene Features to Improve Wide-Area Video Surveillance
We introduce two novel methods to improve the performance of wide area video surveillance applications by using scene features.
@article{scenefeatures2012,title={Using Scene Features to Improve Wide-Area Video Surveillance},author={Wu, Ziyan and Radke, Richard J.},year={2012},journal={Workshop on Camera Networks and Wide Area Scene Analysis (CVPRW)},}
2011
Real-Time Airport Security Checkpoint Surveillance Using a Camera Network
We introduce an airport security checkpoint surveillance system using a camera network that maintains the association between bags and passengers.
@article{airportsecurity2011,title={Real-Time Airport Security Checkpoint Surveillance Using a Camera Network},author={Wu, Ziyan and Radke, Richard J.},year={2011},journal={Workshop on Camera Networks and Wide Area Scene Analysis (CVPRW)},}
Resources are presented for fostering paper-based election technology, comprising a diverse collection of real and simulated ballot and survey images.
@article{electiontech2011,title={Towards Improved Paper-based Election Technology},author={Barney Smith, Elisa and Lopresti, Daniel and Nagy, George and Wu, Ziyan},year={2011},journal={International Conference on Document Analysis and Recognition (ICDAR)},}
Robust tools were developed for determining the underlying grid of the targets on ballots challenged in the 2008 Minnesota elections.
@article{ballots2011,title={Characterizing Challenged Minnesota Ballots},author={Nagy, George and Lopresti, Daniel and Barney Smith, Elisa and Wu, Ziyan},year={2011},journal={Document Recognition and Retrieval XVIII (DRR)},}