OPEN DATA FROM PaN FACILITIES FOR MACHINE LEARNING

Exploiting open data for machine learning training: can the Photon and Neutron community do it?

Introduction

During the last decade, most European Photon and Neutron (PaN) facilities have adopted open data policies, making data available for the benefit of the entire scientific community. At the same time, machine learning (ML) is seen as an essential tool to address the exponential growth of data volumes from PaN facilities.

Exploitation of experimental training datasets is a key component of machine learning. The combination of ML algorithms and open data can therefore be seen as an ideal marriage that would ultimately help the entire community to tackle ‘big data’ challenges with more automation.

However, finding the right data to train machine learning algorithms is a challenge and one of the motivations for making data FAIR is exactly that: to provide scientists working on AI applications with quality training datasets.

But what does 'quality' mean to PaN science communities? What metadata fields are needed to find the data, to understand if it is suitable for our research, and ultimately to be able to ingest it in our training models? How can we provide sufficiently rich metadata? What would be the enablers for more machine learning applications? How can we improve the collaboration between data producers (domain scientists) and data consumers (ML experts)?

Click here for more information