Machine learning algorithms have become popular tools for automated classification of text; however, performance of such algorithms varies and depends on several factors. We examined how a subjective labeling process based on a human factors taxonomy can influence human, as well as automated, classification of safety incident reports from aviation. In order to evaluate these challenges, we trained a machine learning classifier on a subset of 17,253 incident reports from the NASA Aviation Safety Reporting System using multi-label classification, and collected labels from six human annotators for a representative subset of 400 incident reports each, resulting in a total of 2,400 individual annotations. Results showed that, in general, reliability of human annotation for the set of incident reports selected in this study was comparatively low. Performance of machine learning annotation followed patterns of human agreement on labels. Suggestions on how to improve the data collection and labeling process are provided.
Boesser, C. T.,
& Jentsch, F.
(2021). Comparing Human and Machine Learning Classification of Human Factors in Incident Reports From Aviation. 27th International Symposium on Aviation Psychology, 340-345.