Human Activity Recognition in Videos using Extended Machine learning Methods - دانشکده فنی و مهندسی
Human Activity Recognition in Videos using Extended Machine learning Methods
نوع: Type: thesis
مقطع: Segment: masters
عنوان: Title: Human Activity Recognition in Videos using Extended Machine learning Methods
ارائه دهنده: Provider: Kobra Porlefteh
اساتید راهنما: Supervisors: Dr. Hassan Khotanlou - Dr. Muharram Mansoorizadeh
اساتید مشاور: Advisory Professors:
اساتید ممتحن یا داور: Examining professors or referees: Dr. Abbas Ramezani - Dr. Razieh Torkamani
زمان و تاریخ ارائه: Time and date of presentation: 2025
مکان ارائه: Place of presentation: آمفی تئاتر دانشکده مهندسی
چکیده: Abstract: Human activity recognition in videos is one of the fundamental problems in the field of computer vision. Due to the complexity of human behaviors and the limitations of unimodal models, this task has always faced numerous challenges. In this research, a novel multimodal architecture is proposed to improve activity recognition, especially in situations where no samples of a specific activity are available in the training data (zero-shot conditions). The proposed method is built upon the powerful X-CLIP model, which is an extended version of the CLIP image – language model adapted for video understanding. In this architecture, in addition to the visual input branch, an audio processing branch is incorporated. To achieve effective integration of auditory and visual information, a token concatenation mechanism is employed so that both modalities are modeled as a unified token sequence. Subsequently, a lightweight temporal transformer (MIT) is utilized to better capture the relationships between sound and vision and to extract shared temporal patterns. This design allows the model to perform efficient and coherent fusion of audio–visual features at the token level without relying on complex or heavy fusion modules. For training and evaluation, the model was first fine-tuned using the Kinetics-400 dataset, and then tested under zero-shot conditions on the UCF101 dataset. Moreover, data augmentation techniques were applied to address the problem of dataset imbalance during training. Experimental results indicate that combining the audio stream with video improves the baseline performance, particularly in categories with strong semantic dependency on sound (such as musical instrument playing or speaking). On the other hand, the lightweight and simple design of the architecture maintains stable and reliable performance even in silent videos. Overall, the findings demonstrate that a lightweight and unified fusion of audio and visual information based on shared tokens can serve as an effective and scalable approach for enhancing human activity recognition systems in real-world conditions.