Multimodal knowledge-intensive NLP: from representation to augmentation and evaluation

Ghonim, Karim Kamal Mohamed Ashraf

Modern Artificial Intelligence (AI) and Natural Language Processing (NLP) systems are increasingly deployed in knowledge-intensive scenarios that require broad and up-to-date world knowledge, ranging from question answering and information retrieval to digital assistants and automated fact-checking. This progress is largely driven by pretrained Large Language Models (LLMs) and multimodal models, which acquire substantial information through training on large-scale corpora. Yet, despite these advances, current models still face three recurring limitations that hinder reliable deployment. First, their knowledge of concepts and entities is uneven and often skewed toward the domains and distributions represented in widely used training resources and benchmarks. Second, they are typically deployed under a fixed-snapshot view of world knowledge, mirroring the static nature of the inventories that support many downstream pipelines, such as knowledge bases, which makes it difficult to handle newly emerging entities and rapidly evolving facts. Third, when models generate content in settings where new entities and information emerge continuously, current evaluation practices provide limited assurance of factual accuracy and little transparency into which statements are unsupported and why. These challenges become even more pronounced in multimodal settings, where systems must jointly leverage visual and textual evidence to access and validate knowledge. Motivated by these shortcomings, this thesis advances knowledge-intensive language and multimodal AI along three complementary directions that jointly support coverage, dy- namic augmentation, and verification. We first develop large-scale resources and evaluation protocols that broaden semantic coverage beyond the narrow scope of widely used datasets, enabling systematic analysis of representation gaps and supporting the development and assessment of models that generalize across diverse concepts and entities. However, even broad-coverage static resources cannot keep pace with knowledge that emerges after their construction. To address this, we introduce retrieval-augmented generation methods that leverage externally retrieved evidence to produce factually grounded information for newly emerging entities, enabling knowledge inventories to be updated over time without retraining the downstream models that rely on them. As automatically generated content becomes a pathway for continual knowledge integration, ensuring its reliability becomes essential. We therefore propose evidence-grounded evaluation frameworks that decompose model outputs into verifiable units and validate each against source documents, providing fine-grained, interpretable factuality assessment suitable for reference-scarce settings. Overall, this thesis advances AI systems that (i) cover a broader portion of real-world semantics, (ii) remain effective as knowledge evolves and new entities appear, and (iii) can be assessed with transparent, evidence-based factuality signals, enabling more reliable deployment in knowledge-intensive applications.

Multimodal knowledge-intensive NLP: from representation to augmentation and evaluation / Ghonim, K.K.M.A.. - (2026 May 18).