In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.

Unstructured data for large language models / Piktus, Aleksandra. - (2026 Jan 28).

Unstructured data for large language models

PIKTUS, ALEKSANDRA
28/01/2026

Abstract

In recent years, we have witnessed an impressive rise in the ubiquity of large language models (LLMs). Although their fundamental objective, predicting the most probable next word in a sequence, has remained unchanged, the models themselves have expanded dramatically in scale and capability, becoming the dominant paradigm in Natural Language Processing (NLP). Progress has been marked by the development of increasingly sophisticated evaluation benchmarks on one hand and by a growing demand for vast amounts of training data on the other. In this thesis, we examine how unstructured, primarily web-based data is utilized in LLM pre-training and fine-tuning. We investigate two principal roles that large textual corpora play within these models: first, as a source of world knowledge through retrieval augmentation, and second, as pre-training data. We begin by demonstrating how retrieval from large, unstructured web corpora can enhance performance on open-domain tasks, paving the way towards assistants capable of supporting humans in solving complex, knowledge-intensive problems. Next, we address the challenge of improving the robustness of pre-training data through the development of tools that enable qualitative analysis of massive text collections. Finally, we explore potential avenues for model scaling under data-constrained conditions, anticipating a future in which the entirety of publicly available web text may no longer suffice to meet the demands of ever-larger language models.
28-gen-2026
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_Piktus.pdf

accesso aperto

Note: tesi completa
Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 3.53 MB
Formato Adobe PDF
3.53 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1761517
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact