Author: Joan Giner Miguelez
Programme: Doctoral Programme in Network and Information Technologies
Idioma: English
Supervision: Dr Abel Gómez Llana & Dr Jordi Cabot Sagrera
Faculty / Institute: Doctoral School UOC
Subjects: Computer Science
Key words: data-sharing practices, machine learning, trustworthy AI, fairness, data documentation
Area of knowledge: Network and Information Technologies
Summary
Machine learning (ML) technology may discriminate toward specific social groups. For example, recent research have revealed that ML applications are more likely to fail in identifying women than males in hospitals. Recent research has identified the data used to train these models as one of the causes of these issues. The research community has proposed guidelines to detect the dimensions that can generate these discriminatory behaviors. However, these proposals lack a set structure, restricting their computation and the creation of engineering approaches built upon them. This thesis presents a domain-specific language to document data for ML. This language has served as a basis for creating the responsible AI extension of \emph{Croissant}, a standard adopted by major search engines, such as \emph{Google Dataset Search}. Moreover, this thesis studies the use of large language models (LLM) to automatically create data documentation and the readiness of scientific data for its use in ML.