The datasets support the evaluation of fair name-based gender prediction software across two scientific domains: energy transition and critical infrastructures. Each dataset contains public information on scientific authors and their gender, determined through manual validation and compared against predictions from multiple automated tools. The gender labels in these datasets represent the assessment of human annotators based solely on the information available (e.g., names) and do not necessarily reflect the self-identified gender or gender perception of the authors. The energy transition dataset is derived from papers retrieved from Scopus using the query terms “energy transition” OR “energy transformation.” The initial set of 17,591 papers was refined to 10,130 using the Energy Systems Ontology (ESO) (De Nicola et al., 2024), authored by 27,363 individuals. From this population, 1,000 authors were randomly selected for manual gender validation, resulting in 260 females, 575 males, and 165 of undetermined gender. The critical infrastructures dataset is based on all 380 papers published between 2006 and 2022 in the proceedings of the International Conference on Critical Information Infrastructures Security (CRITIS), involving 929 authors. All authors were manually validated, yielding 153 females, 768 males, and 8 of undetermined gender. The datasets are provided in JSON format, one file per domain: - ET-report.json contains records for the 1,000 manually validated authors in the energy transition dataset. Each record includes the author’s full name, the Semantic Scholar ID, the manual validation gender label, and the predictions from multiple automated gender prediction tools (Prediction Manager, Gender API, ChatGPT, and NamSor). - CRITIS-report.json contains records for all 929 manually validated authors in the critical infrastructures dataset, with the same structure and fields as the energy transition file, except without the Semantic Scholar ID. These structured files enable reproducible analysis, cross-tool performance comparisons, and integration into further research workflows.

Datasets for Fair Name-Based Gender Prediction in Scientific Communities / Guariglia Migliore, Maria; D'Agostino, Gregorio; Patriarca, Tatiana; De Nicola, Antonio. - (2025). [10.6084/m9.figshare.29909603.v1]

Datasets for Fair Name-Based Gender Prediction in Scientific Communities

Maria Guariglia Migliore;
2025

Abstract

The datasets support the evaluation of fair name-based gender prediction software across two scientific domains: energy transition and critical infrastructures. Each dataset contains public information on scientific authors and their gender, determined through manual validation and compared against predictions from multiple automated tools. The gender labels in these datasets represent the assessment of human annotators based solely on the information available (e.g., names) and do not necessarily reflect the self-identified gender or gender perception of the authors. The energy transition dataset is derived from papers retrieved from Scopus using the query terms “energy transition” OR “energy transformation.” The initial set of 17,591 papers was refined to 10,130 using the Energy Systems Ontology (ESO) (De Nicola et al., 2024), authored by 27,363 individuals. From this population, 1,000 authors were randomly selected for manual gender validation, resulting in 260 females, 575 males, and 165 of undetermined gender. The critical infrastructures dataset is based on all 380 papers published between 2006 and 2022 in the proceedings of the International Conference on Critical Information Infrastructures Security (CRITIS), involving 929 authors. All authors were manually validated, yielding 153 females, 768 males, and 8 of undetermined gender. The datasets are provided in JSON format, one file per domain: - ET-report.json contains records for the 1,000 manually validated authors in the energy transition dataset. Each record includes the author’s full name, the Semantic Scholar ID, the manual validation gender label, and the predictions from multiple automated gender prediction tools (Prediction Manager, Gender API, ChatGPT, and NamSor). - CRITIS-report.json contains records for all 929 manually validated authors in the critical infrastructures dataset, with the same structure and fields as the energy transition file, except without the Semantic Scholar ID. These structured files enable reproducible analysis, cross-tool performance comparisons, and integration into further research workflows.
2025
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1753945
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact