ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.

The ROOTS Search Tool: Data Transparency for LLMs / Piktus, Aleksandra; Akiki, Christopher; Villegas, Paulo; Laurençon, Hugo; Dupont, Gérard; Luccioni, Sasha; Jernite, Yacine; Rogers, Anna. - (2023), pp. 304-314. (Intervento presentato al convegno ACL tenutosi a Toronto; Canada) [10.18653/v1/2023.acl-demo.29].

The ROOTS Search Tool: Data Transparency for LLMs

Piktus, Aleksandra
;
2023

Abstract

ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.
2023
ACL
natural language processing, large language models; training data inspection tools
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
The ROOTS Search Tool: Data Transparency for LLMs / Piktus, Aleksandra; Akiki, Christopher; Villegas, Paulo; Laurençon, Hugo; Dupont, Gérard; Luccioni, Sasha; Jernite, Yacine; Rogers, Anna. - (2023), pp. 304-314. (Intervento presentato al convegno ACL tenutosi a Toronto; Canada) [10.18653/v1/2023.acl-demo.29].
File allegati a questo prodotto
File Dimensione Formato  
Piktus_ROOTS_2023.pdf

accesso aperto

Note: DOI: 10.18653/v1/2023.acl-demo.29 - https://aclanthology.org/2023.acl-demo.29.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 384.1 kB
Formato Adobe PDF
384.1 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1717586
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 3
social impact