News
December 2015
First Release of the QUAERO French Medical Corpus
Read More


Table of contents

The QUAERO French Medical Corpus

Introduction

The QUAERO French Medical Corpus has been initially developed as a resource for named entity recognition and normalization [1]. It was then improved with the purpose of creating a gold standard set of normalized entities for French biomedical text, that was used in the CLEF eHealth evaluation lab [2].

A selection of MEDLINE titles and EMEA documents were manually annotated. The annotation process was guided by concepts in the Unified Medical Language System (UMLS):

1. Ten types of clinical entities, as defined by the following UMLS Semantic Groups (Bodenreider and McCray 2003) were annotated: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings, Objects, Phenomena, Physiology, Procedures.

2. The annotations were made in a comprehensive fashion, so that nested entities were marked, and entities could be mapped to more than one UMLS concept. In particular: (a) If a mention can refer to more than one Semantic Group, all the relevant Semantic Groups should be annotated. For instance, the mention “récidive” (recurrence) in the phrase “prévention des récidives” (recurrence prevention) should be annotated with the category “DISORDER” (CUI C2825055) and the category “PHENOMENON” (CUI C0034897); (b) If a mention can refer to more than one UMLS concept within the same Semantic Group, all the relevant concepts should be annotated. For instance, the mention “maniaques” (obsessive) in the phrase “patients maniaques” (obsessive patients) should be annotated with CUIs C0564408 and C0338831 (category “DISORDER”); (c) Entities which span overlaps with that of another entity should still be annotated. For instance, in the phrase “infarctus du myocarde” (myocardial infarction), the mention “myocarde” (myocardium) should be annotated with category “ANATOMY” (CUI C0027061) and the mention “infarctus du myocarde” should be annotated with category “DISORDER” (CUI C0027051)

We provide below examples of the annotations that can be found in the QUAERO French Medical corpus.

License

The QUAERO French Medical corpus is released under the GNU Free Documentation License (GFDL)

Any research using this corpus for running experiments should include the following citation:

Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30

Here is the Bibtex entry:

@InProceedings{neveol14quaero, 
  author = {Névéol, Aurélie and Grouin, Cyril and Leixa, Jeremy 
            and Rosset, Sophie and Zweigenbaum, Pierre},
  title = {The {QUAERO} {French} Medical Corpus: A Ressource for
           Medical Entity Recognition and Normalization}, 
  OPTbooktitle = {Proceedings of the Fourth Workshop on Building 
                 and Evaluating Ressources for Health and Biomedical 
                 Text Processing}, 
  booktitle = {Proc of BioTextMining Work}, 
  OPTseries = {BioTxtM 2014}, 
  year = {2014}, 
  pages = {24--30}, 
}
             

File Format

Annotations are available in the BRAT Rapid Annotation Tool (BRAT) standoff format, described here: http://brat.nlplab.org/standoff.html, which can be loaded into BRAT for vizualization.

Sample annotations are shown below.

Sample MEDLINE title 1
La contraception par les dispositifs intra utérins
Sample MEDLINE title 1 annotations
T1 PROC 3 16 contraception
#1 AnnotatorNotes T1 C0700589
T2 DEVI 25 50 dispositifs intra utérins
#2 AnnotatorNotes T2 C0021900
T3 ANAT 43 50 utérins
#3 AnnotatorNotes T3 C0042149
Sample MEDLINE title 2
Méningites bactériennes de l' adulte en réanimation médicale .
Sample MEDLINE title 2 annotations
T1 DISO 0 23 Méningites bactériennes
#1 AnnotatorNotes T1 C0085437
T2 LIVB 29 36 adulte
#2 AnnotatorNotes T2 C0001765
T3 PROC 40 60 réanimation médicale
#3 AnnotatorNotes T3 C0085559
Sample EMEA document (excerpt)
(...)
Dans quel cas Tysabri est-il utilisé ?
Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).
(...)
Sample EMEA document annotations (excerpt)
(...)
T9 CHEM 206 213 Tysabri
#9 AnnotatorNotes T9 C1529600
T10 CHEM 233 240 Tysabri
#10 AnnotatorNotes T10 C1529600
T11 PROC 261 271 traitement
#11 AnnotatorNotes T11 C0087111
T12 LIVB 276 283 adultes
#12 AnnotatorNotes T12 C0001675
T13 DISO 296 315 sclérose en plaques
#13 AnnotatorNotes T13 C0026769
T14 DISO 318 321 SEP
#14 AnnotatorNotes T14 C0026769
(...)

Corpus Download

Version released in December 2015, as an archive of the CLEF eHealth 2015 Task 1b Dataset: Download here with appropriate credentials (Contact us to obtain credentials).

Training Folder
MEDLINE Corpus Description Number of Files
.txt Corpus text files: article titles (in French) 833 files
.ann annotation files in BRAT stand-off format 833 files
.conf BRAT configuration files 3 files
EMEA Corpus Description Number of Files
.txt Corpus text files: EMEA drug inserts (in French) 3 documents, segmented into 11 files
.ann annotation files in BRAT stand-off format 11 files
.conf BRAT configuration files 3 files
Test Folder
MEDLINE Corpus Description Number of Files
.txt Corpus text files: article titles (in French) 832 files
.ann annotation files in BRAT stand-off format 832 files
.conf BRAT configuration files 3 files
EMEA Corpus Description Number of Files
.txt Corpus text files: EMEA drug inserts (in French) 3 documents, segmented into 12 files
.ann annotation files in BRAT stand-off format 12 files
.conf BRAT configuration files 3 files
Evaluation Folder
Software Description Number of Files
.jar brateval tool with specific functionalities developped for CLEF e-Health 2015 Task 1b 1 package

People Involved

  • Cyril Grouin
  • Jeremy Leixa
  • Aurélie Névéol
  • Sophie Rosset
  • Xavier Tannier
  • Pierre Zweigenbaum

Publications

  • [1] Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30 [pdf]
  • [2] Névéol A, Grouin C, Tannier X, Hamon T, Kelly L, Goeuriot L, Zweigenbaum P. (2015) Task 1b of the CLEF eHealth Evaluation Lab 2015: Clinical Named Entity Recognition. CLEF 2015 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September, 2015.[pdf]

Acknowledgements

This work was funded by OSEO under the Quaero program and by the ANR CABeRNeT project (ANR-13-JS02-009).