The widespread adoption of electronic health records (EHRs) offers a valuable opportunity to support clinical research by containing crucial patient information, including diagnoses, symptoms, medications, lab tests, and more. Despite the success of deep learning for biomedical Named Entity Recognition (NER), the literature in this field still presents a gap regarding applications focused on lung cancer for the Italian language. Hence, this paper presents a transformer-based approach to extract named entities from Italian clinical notes related to Non-Small Cell Lung Cancer (NSCLC). We introduce a novel set of 25 clinical entities related to NSCLC building a corpus annotated for NER. We apply a state-of the-art model pre-trained on Italian biomedical texts to the manually annotated clinical reports of a cohort of 257 patients suffering from NSCLC, successfully dealing with class-imbalance problems and obtaining promising performance (average F1-score of 84.3%). We also compared our method with two other pre-trained state-of-the-art models showing that the domain specific knowledge offered by the proposed approach is necessary to achieve higher performance. These findings also showcase the feasibility of using transformers to extract biomedical information in the Italian language.

Named Entity Recognition in Italian Lung Cancer Clinical Reports using Transformers

Bria A.;
2023-01-01

Abstract

The widespread adoption of electronic health records (EHRs) offers a valuable opportunity to support clinical research by containing crucial patient information, including diagnoses, symptoms, medications, lab tests, and more. Despite the success of deep learning for biomedical Named Entity Recognition (NER), the literature in this field still presents a gap regarding applications focused on lung cancer for the Italian language. Hence, this paper presents a transformer-based approach to extract named entities from Italian clinical notes related to Non-Small Cell Lung Cancer (NSCLC). We introduce a novel set of 25 clinical entities related to NSCLC building a corpus annotated for NER. We apply a state-of the-art model pre-trained on Italian biomedical texts to the manually annotated clinical reports of a cohort of 257 patients suffering from NSCLC, successfully dealing with class-imbalance problems and obtaining promising performance (average F1-score of 84.3%). We also compared our method with two other pre-trained state-of-the-art models showing that the domain specific knowledge offered by the proposed approach is necessary to achieve higher performance. These findings also showcase the feasibility of using transformers to extract biomedical information in the Italian language.
File in questo prodotto:
File Dimensione Formato  
Named_Entity_Recognition_in_Italian_Lung_Cancer_Clinical_Reports_using_Transformers.pdf

solo utenti autorizzati

Licenza: Copyright dell'editore
Dimensione 1.01 MB
Formato Adobe PDF
1.01 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11580/107385
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
social impact