Data augmentation is a fundamental technique in machine learning to enhance model generalization by artificially expanding training datasets. However, conventional augmentation approaches often rely on heuristic transformations that may not fully capture domain-specific knowledge. This position paper advocates a data-centric AI perspective on data augmentation, emphasizing the integration of semantic technologies, particularly domain ontologies, to guide augmentation strategies. The use of techniques from Symbolic AI for data augmentation has been dealt with only in a few recent papers. Our goal is to explore further this idea, based on the consideration that an explicit representation of the domain may be helpful in two key tasks: optimizing the generation of new data, and validating the generated data, both fundamental steps for all data augmentation strategies. We aim at developing novel approaches that combine ontologies and data augmentation techniques to address these two tasks, in particular by relying on automated reasoning. We argue that leveraging knowledge representation and symbolic reasoning enables more principled and context-aware data augmentation, leading to improved model robustness and fairness.
Data Augmentation for Data-Centric AI Through the Lens of Semantic Technologies: A Position Paper / Cabibbo, Luca; Bertillo, Daniele; Cima, Gianluca; Crescenzi, Valter; Console, Marco; Delfino, Roberto Maria; Iannucci, Stefano; Lembo, Domenico; Lenzerini, Maurizio; Marconi, Lorenzo; Merialdo, Paolo; Napoleone, Marco; Papi, Laura; Poggi, Antonella; Scafoglieri, Federico; Torlone, Riccardo. - 4182:(2026). ( Symposium on Advanced Database Systems Ischia; Italy ).
Data Augmentation for Data-Centric AI Through the Lens of Semantic Technologies: A Position Paper
Gianluca Cima;Valter Crescenzi;Marco Console;Roberto Maria Delfino;Stefano Iannucci;Domenico Lembo;Maurizio Lenzerini;Lorenzo Marconi;Laura Papi;Antonella Poggi;Federico Scafoglieri;
2026
Abstract
Data augmentation is a fundamental technique in machine learning to enhance model generalization by artificially expanding training datasets. However, conventional augmentation approaches often rely on heuristic transformations that may not fully capture domain-specific knowledge. This position paper advocates a data-centric AI perspective on data augmentation, emphasizing the integration of semantic technologies, particularly domain ontologies, to guide augmentation strategies. The use of techniques from Symbolic AI for data augmentation has been dealt with only in a few recent papers. Our goal is to explore further this idea, based on the consideration that an explicit representation of the domain may be helpful in two key tasks: optimizing the generation of new data, and validating the generated data, both fundamental steps for all data augmentation strategies. We aim at developing novel approaches that combine ontologies and data augmentation techniques to address these two tasks, in particular by relying on automated reasoning. We argue that leveraging knowledge representation and symbolic reasoning enables more principled and context-aware data augmentation, leading to improved model robustness and fairness.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


