Introduction
The QUAERO French Medical Corpus has been initially developed as a resource for named entity recognition and normalization [1]. It was then improved with the purpose of creating a gold standard set of normalized entities for French biomedical text, that was used in the CLEF eHealth evaluation lab [2].
A selection of MEDLINE titles and EMEA documents were manually annotated. The annotation process was guided by concepts in the Unified Medical Language System (UMLS):
1. Ten types of clinical entities, as defined by the following UMLS Semantic Groups (Bodenreider and McCray 2003) were annotated: Anatomy, Chemical and Drugs, Devices, Disorders, Geographic Areas, Living Beings, Objects, Phenomena, Physiology, Procedures.
2. The annotations were made in a comprehensive fashion, so that nested entities were marked, and entities could be mapped to more than one UMLS concept. In particular: (a) If a mention can refer to more than one Semantic Group, all the relevant Semantic Groups should be annotated. For instance, the mention “récidive” (recurrence) in the phrase “prévention des récidives” (recurrence prevention) should be annotated with the category “DISORDER” (CUI C2825055) and the category “PHENOMENON” (CUI C0034897); (b) If a mention can refer to more than one UMLS concept within the same Semantic Group, all the relevant concepts should be annotated. For instance, the mention “maniaques” (obsessive) in the phrase “patients maniaques” (obsessive patients) should be annotated with CUIs C0564408 and C0338831 (category “DISORDER”); (c) Entities which span overlaps with that of another entity should still be annotated. For instance, in the phrase “infarctus du myocarde” (myocardial infarction), the mention “myocarde” (myocardium) should be annotated with category “ANATOMY” (CUI C0027061) and the mention “infarctus du myocarde” should be annotated with category “DISORDER” (CUI C0027051)
We provide below examples of the annotations that can be found in the QUAERO French Medical corpus.
License
The QUAERO French Medical corpus is released under the GNU Free Documentation License (GFDL)
Any research using this corpus for running experiments should include the following citation:
Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30
Here is the Bibtex entry:
@InProceedings{neveol14quaero, author = {Névéol, Aurélie and Grouin, Cyril and Leixa, Jeremy and Rosset, Sophie and Zweigenbaum, Pierre}, title = {The {QUAERO} {French} Medical Corpus: A Ressource for Medical Entity Recognition and Normalization}, OPTbooktitle = {Proceedings of the Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing}, booktitle = {Proc of BioTextMining Work}, OPTseries = {BioTxtM 2014}, year = {2014}, pages = {24--30}, }
File Format
Annotations are available in the BRAT Rapid Annotation Tool (BRAT) standoff format, described here: http://brat.nlplab.org/standoff.html, which can be loaded into BRAT for vizualization.
Sample annotations are shown below.
Sample MEDLINE title 1 |
La contraception par les dispositifs intra utérins |
Sample MEDLINE title 1 annotations |
T1 PROC 3 16 contraception #1 AnnotatorNotes T1 C0700589 T2 DEVI 25 50 dispositifs intra utérins #2 AnnotatorNotes T2 C0021900 T3 ANAT 43 50 utérins #3 AnnotatorNotes T3 C0042149 |
Sample MEDLINE title 2 |
Méningites bactériennes de l' adulte en réanimation médicale . |
Sample MEDLINE title 2 annotations |
T1 DISO 0 23 Méningites bactériennes #1 AnnotatorNotes T1 C0085437 T2 LIVB 29 36 adulte #2 AnnotatorNotes T2 C0001765 T3 PROC 40 60 réanimation médicale #3 AnnotatorNotes T3 C0085559 |
Sample EMEA document (excerpt) |
(...) Dans quel cas Tysabri est-il utilisé ? Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ). (...) |
Sample EMEA document annotations (excerpt) |
(...) T9 CHEM 206 213 Tysabri #9 AnnotatorNotes T9 C1529600 T10 CHEM 233 240 Tysabri #10 AnnotatorNotes T10 C1529600 T11 PROC 261 271 traitement #11 AnnotatorNotes T11 C0087111 T12 LIVB 276 283 adultes #12 AnnotatorNotes T12 C0001675 T13 DISO 296 315 sclérose en plaques #13 AnnotatorNotes T13 C0026769 T14 DISO 318 321 SEP #14 AnnotatorNotes T14 C0026769 (...) |
Corpus Download
Version released in December 2015, as an archive of the CLEF eHealth 2015 Task 1b Dataset: Download here with appropriate credentials (Contact us to obtain credentials).Training Folder | |||
MEDLINE Corpus | Description | Number of Files | |
---|---|---|---|
.txt | Corpus text files: article titles (in French) | 833 files | |
.ann | annotation files in BRAT stand-off format | 833 files | |
.conf | BRAT configuration files | 3 files | |
EMEA Corpus | Description | Number of Files | |
.txt | Corpus text files: EMEA drug inserts (in French) | 3 documents, segmented into 11 files | |
.ann | annotation files in BRAT stand-off format | 11 files | |
.conf | BRAT configuration files | 3 files | |
Test Folder | |||
MEDLINE Corpus | Description | Number of Files | |
.txt | Corpus text files: article titles (in French) | 832 files | |
.ann | annotation files in BRAT stand-off format | 832 files | |
.conf | BRAT configuration files | 3 files | |
EMEA Corpus | Description | Number of Files | |
.txt | Corpus text files: EMEA drug inserts (in French) | 3 documents, segmented into 12 files | |
.ann | annotation files in BRAT stand-off format | 12 files | |
.conf | BRAT configuration files | 3 files | |
Evaluation Folder | |||
Software | Description | Number of Files | |
.jar | brateval tool with specific functionalities developped for CLEF e-Health 2015 Task 1b | 1 package |
People Involved
- Cyril Grouin
- Jeremy Leixa
- Aurélie Névéol
- Sophie Rosset
- Xavier Tannier
- Pierre Zweigenbaum
Publications
- [1] Névéol A, Grouin C, Leixa J, Rosset S, Zweigenbaum P. The QUAERO French Medical Corpus: A Ressource for Medical Entity Recognition and Normalization. Fourth Workshop on Building and Evaluating Ressources for Health and Biomedical Text Processing - BioTxtM2014. 2014:24-30 [pdf]
- [2] Névéol A, Grouin C, Tannier X, Hamon T, Kelly L, Goeuriot L, Zweigenbaum P. (2015) Task 1b of the CLEF eHealth Evaluation Lab 2015: Clinical Named Entity Recognition. CLEF 2015 Evaluation Labs and Workshop: Online Working Notes, CEUR-WS, September, 2015.[pdf]
Acknowledgements
This work was funded by OSEO under the Quaero program and by the ANR CABeRNeT project (ANR-13-JS02-009).