Improving image interpretation using deep neural networks - دانشکده فنی و مهندسی
Improving image interpretation using deep neural networks

نوع: Type: Thesis
مقطع: Segment: masters
عنوان: Title: Improving image interpretation using deep neural networks
ارائه دهنده: Provider: EhsanAkefi
اساتید راهنما: Supervisors: hassan khotanlou- Muharram Mansoorizadeh
اساتید مشاور: Advisory Professors:
اساتید ممتحن یا داور: Examining professors or referees: reza mohammadi- Abbas Ramazani
زمان و تاریخ ارائه: Time and date of presentation: 2025
مکان ارائه: Place of presentation: آمفی تاتر
چکیده: Abstract: Image captioning is a combination of two key areas, computer vision and natural language processing, aimed at providing accurate, meaningful, and relevant descriptions for images. In this research, a novel approach for improving image captioning systems is presented, which utilizes the combination of two advanced models, Swin Transformer and ConvNeXt, for extracting visual features. The Swin Transformer model, with its hierarchical structure, is capable of effectively identifying relationships among the components of an image, while the ConvNeXt model, by enhancing the architecture of convolutional neural networks, extracts and represents local features of the image with greater precision. This combination enables a more comprehensive and accurate feature extraction, thereby increasing the model's accuracy in understanding the content of the image. In the text generation section, a transformer-based model is used that, by leveraging the attention mechanism, reinforcement learning, and the similarity score obtained from the CLIP model, generates precise, distinctive, and relevant descriptions for the image. The CLIP model, which jointly represents images and texts in a shared semantic space, is employed as a tool to evaluate and guide the model in producing meaningful sentences that correspond to the image. For the evaluation of this method, in addition to traditional metrics such as BLEU, CIDEr, and ROUGE, more advanced metrics like CLIPScore and BERTScore are used to measure the semantic similarity of the generated descriptions. The experimental results on standard datasets such as MS COCO and a new dataset called FineCapEval indicate that the proposed method significantly improves the quality and accuracy of the generated descriptions compared to previous models. This research, by presenting a comprehensive and innovative approach to visual feature extraction and textual description generation, has taken an effective step toward solving the challenges in this field.
فایل: ّFile: Download فایل