Human Activity Recognition in Videos using Extended Machine learning Methods

نوع: Type: thesis

مقطع: Segment: masters

عنوان: Title: Human Activity Recognition in Videos using Extended Machine learning Methods

ارائه دهنده: Provider: Kobra Porlefteh

اساتید راهنما: Supervisors: Dr. Hassan Khotanlou - Dr. Muharram Mansoorizadeh

اساتید مشاور: Advisory Professors:

اساتید ممتحن یا داور: Examining professors or referees: Dr. Abbas Ramezani - Dr. Razieh Torkamani

زمان و تاریخ ارائه: Time and date of presentation: 2025

مکان ارائه: Place of presentation: آمفی تئاتر دانشکده مهندسی

چکیده: Abstract: Human activity recognition in videos is one of the fundamental problems in the field of computer vision. Due to the complexity of human behaviors and the limitations of unimodal models, this task has always faced numerous challenges. In this research, a novel multimodal architecture is proposed to improve activity recognition, especially in situations where no samples of a specific activity are available in the training data (zero-shot conditions). The proposed method is built upon the powerful X-CLIP model, which is an extended version of the CLIP image – language model adapted for video understanding. In this architecture, in addition to the visual input branch, an audio processing branch is incorporated. To achieve effective integration of auditory and visual information, a token concatenation mechanism is employed so that both modalities are modeled as a unified token sequence. Subsequently, a lightweight temporal transformer (MIT) is utilized to better capture the relationships between sound and vision and to extract shared temporal patterns. This design allows the model to perform efficient and coherent fusion of audio–visual features at the token level without relying on complex or heavy fusion modules. For training and evaluation, the model was first fine-tuned using the Kinetics-400 dataset, and then tested under zero-shot conditions on the UCF101 dataset. Moreover, data augmentation techniques were applied to address the problem of dataset imbalance during training. Experimental results indicate that combining the audio stream with video improves the baseline performance, particularly in categories with strong semantic dependency on sound (such as musical instrument playing or speaking). On the other hand, the lightweight and simple design of the architecture maintains stable and reliable performance even in silent videos. Overall, the findings demonstrate that a lightweight and unified fusion of audio and visual information based on shared tokens can serve as an effective and scalable approach for enhancing human activity recognition systems in real-world conditions.