Named Entity Recognition in Italian Lung Cancer Clinical Reports using Transformers

Paolo, D.; Bria, A.; Greco, C.; Russano, M.; Ramella, S.; Soda, P.; Sicilia, R.

doi:10.1109/BIBM58861.2023.10385778

The widespread adoption of electronic health records (EHRs) offers a valuable opportunity to support clinical research by containing crucial patient information, including diagnoses, symptoms, medications, lab tests, and more. Despite the success of deep learning for biomedical Named Entity Recognition (NER), the literature in this field still presents a gap regarding applications focused on lung cancer for the Italian language. Hence, this paper presents a transformer-based approach to extract named entities from Italian clinical notes related to Non-Small Cell Lung Cancer (NSCLC). We introduce a novel set of 25 clinical entities related to NSCLC building a corpus annotated for NER. We apply a state-of the-art model pre-trained on Italian biomedical texts to the manually annotated clinical reports of a cohort of 257 patients suffering from NSCLC, successfully dealing with class-imbalance problems and obtaining promising performance (average F1-score of 84.3%). We also compared our method with two other pre-trained state-of-the-art models showing that the domain specific knowledge offered by the proposed approach is necessary to achieve higher performance. These findings also showcase the feasibility of using transformers to extract biomedical information in the Italian language.