Extracting medication information from discharge summaries
Extracting Medication Information from Discharge Summaries
Scott Halgrim, Fei Xia, Imre Solti, Eithon Cadag
University of Washington
University of Albany, SUNY
Seattle, WA 98195, USA
Albany, NY 12222, USA
(https://www.i2b2.org/NLP/Medication/), a task
we refer to as the i2b2 challenge
in this paper.
In the past decade, there has been extensive re-
Extracting medication information from
search on information extraction in both the gen-
clinical records has many potential appli-
eral and biomedical domains (Wellner et al., 2004;
cations and was the focus of the i2b2
Grenager et al., 2005; Poon and Domingos, 2007;
challenge in 2009. We present a hybrid
Meystre et al, 2008; Rozenfeld and Feldman,
system, comprised of machine learning
2008). Interestingly, despite the recent prevalence
and rule-based modules, for medication
of statistical approaches in most NLP tasks (in-
information extraction. With only a hand-
cluding information extraction), most of the sys-
ful of template-filling rules, the system's
tems developed for the i2b2 challenge were rule-
core is a cascade of statistical classifiers
based. In this paper we present our hybrid system,
for field detection. It achieved good per-
whose core is a cascade of statistical classifiers that
formance that was comparable to the top
identify medication fields such as medication
systems in the i2b2 challenge, demon-
names and dosages. The fields are then assembled
strating that a heavily statistical ap-
to form medication entries. While our system did
proach can perform as well or better than
not participate in the i2b2 challenge (as we were
systems with many sophisticated rules.
part of the organizing team), it achieved good re-
The system can easily incorporate addi-
sults that matched the top i2b2 systems.
tional resources such as medication name lists to further improve performance.
2 The i2b2 Challenge
This section provides a brief introduction to the
Narrative clinical records store patient medical 2.1 The task
information, and extracting this information from these narratives supports data management and The i2b2 challenge studied the automatic extrac-enables many applications (Levin et al., 2007). tion of information corresponding to the following Informatics for Integrating the Biology and the fields
from hospital discharge summaries (Uzuner, Bedside (i2b2
) is an NIH-funded National Center et al., 2010a): names of medications (m
) taken by for Biomedical Computing based at Partners the patient, dosages (do
), modes (mo
), frequencies HealthCare System, and it has organized annual (f
), durations (du
), and reasons (r
) for taking these NLP shared tasks and challenges since 2006 medications. We refer to the medication field as (https://www.i2b2.org/). The Third i2b2 Workshop the name
field and the other five fields as the non-
on NLP Challenges for Clinical Records in 2009 name
fields. All non-name fields correspond to studied the extraction of medication information some name field mention; if they were specified from
summaries within a two-line window of that name mention,
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents, pages 61–67,
Los Angeles, California, June 2010. c
2010 Association for Computational Linguistics
the i2b2 challenge required such fields to be linked to the name field to form an entry.
For each entry, • The participating teams produced system out-
a system must determine whether the entry ap-
puts for 547 discharge summaries set aside for
peared in a list of medications or in narrative text.
testing. After the challenge, 251 of these sum-
Table 1 shows an excerpt from a discharge sum-
maries were annotated by the challenge par-
mary and the corresponding entries in the gold
ticipants, and these 251 summaries formed the
standard. The first entry appears in narrative text,
final test set (Uzuner et al., 2010b).
and the second in a list of medication information.
The sizes of the data sets used in our experiments
Excerpt of Discharge Summary
are shown in Table 2. The training and develop-
55 the patient noted that he had a recurrence of this
ment sets were created by the University of Syd-
56 vague chest discomfort as he was sitting and
ney, and the test data is the i2b2 official challenge
57 talking to friends. He took a sublingual
test set. The average number of entries and fields
58 Nitroglycerin without relief.
vary among the three sets because the summaries
in the test set were chosen randomly
from the 547
65 Flomax ( Tamsulosin ) 0.4 mg, po, qd,.
summaries, whereas the University of Sydney team
annotated the longest
m="Nitroglycerin" 58:0 58:0 do="nm" mo="sublingual" 57:6 57:6 f="nm" du="nm" r="vague chest discomfort" 56:0 56:2
2.3 Additional resources
Besides the training data, the participating teams
were allowed to use any additional tools and re-
m="flomax ( tamsulosin )" 65:0 65:3 do="0.4 mg" 65:4 65:5 mo="po" 65:6 65:6 f="qd" 65:7
sources that they had access to, including resources
65:7 du="nm" r="nm" ln="list"
not available to the public. All challenge partici-
pants used additional resources such as UMLS
Table 1: A sample discharge summary excerpt and (www.nlm.nih.gov/research/umls/), but the exact the corresponding entries in the gold standard. The resources used varied from team to team. There-fields inside an entry are separated by " ". Each fore, the challenge was similar to the so-called field is represented by the string and its position open-track
challenge in the general NLP field, as (i.e., "line number: token number" offsets). "nm" opposed to a closed-track
challenge that could re-means the field value is not mentioned for this me-
quire that all the participants use only the list of
resources specified by the challenge organizers
2.2 Data Sets
2.4 Evaluation metrics
The i2b2 challenge used a total of 1243 discharge
The i2b2 challenge used two sets of evaluation metrics: horizontal
• 696 of these summaries were released to par-
ticipants for system development, and the i2b2 tal metrics measured system performance at the organizing team provided the gold standard entry level, whereas vertical metrics measured sys-
tem performance at the field level. Both sets of
annotation for 17 of them.
metrics compared the system output and the gold standard at the span
level for exact match
• Participating teams could choose to annotate
level for inexact match
, using precision,
more files themselves. The University of Syd-ney team annotated 145 out of the 696 summa-
recall, and F-score (Uzuner et al., 2010a). The pri-
ries (including re-annotating 14 of the 17 files mary metric for the challenge is exact horizontal F-annotated by the i2b2 organizing team) and score, which is the metric we use to evaluate our
generously shared their annotations with i2b2 after the challenge for future research. We ob-
tained and used 110 of their annotations as our
training set and the remaining 35 summaries as our development set.
Table 2: The data sets used in our experiments. The numbers in parentheses are the average numbers
of entries or fields per discharge summary.
step includes a section segmenter that breaks dis-
2.5 Participating systems
charge summaries into sections. Discharge summa-
Twenty teams participated in the challenge. Fifteen ries tend to consist of sections such as ‘ADMIT teams used rule-based approaches, and the rest DIAGNOSIS', ‘PAST MEDICAL HISTORY', used statistical or hybrid approaches. The perform-
and ‘DISCHARGE MEDICATIONS'. Knowing
ances of the top five systems are shown in Table 3. section boundaries is important for the i2b2 chal-Among them, only the top system, developed by lenge because, according to the annotation guide-the University of Sydney, used a hybrid approach, lines for creating the gold standard, medications whereas the rest were rule-based.
occurring under certain sections (e.g., family his-
tory and allergic reaction) should be excluded from the system output. Furthermore, knowing the types
of sections could be useful for field detection and field linking; for example, entries in the
‘DISCHARGE MEDICATIONS' section are more
likely to appear in a list of medications than in nar-rative text.
The set of sections and the exact spelling of
section headings vary across discharge summaries.
The section segmenter uses regular expressions
(e.g., ‘ s*([A-Z s]+):' -- a line starting with a se-
Table 3: The performance (exact horizontal preci-
quence of capitalized words followed by a colon)
sion/recall/F-score) of the top five i2b2 systems on to collect potential section headings from the train-the test set.
ing data, and the headings whose frequencies are higher than a threshold are used to identify section
3 System description
boundaries in the discharge summaries.
3.2 Field detection
We developed a hybrid system with three process-ing steps: (1) a preprocessing step that finds sec-
This step consists of three modules: the first mod-
tion boundaries, (2) a field detection step that ule, find_name
, finds medication names in a dis-identifies the six fields, and (3) a field linking step charge summary; the second module, context_type
, that links fields together to form entries. The sec-
processes each medication name identified by
ond step is a statistical system, whereas the other find_name
and determines whether the medication two steps are rule-based. The second step was the appears in narrative text or in a list of medications; main focus of this study.
the third module, find_others
, detects the five non-name field types.
, we follow the
In addition to common processing steps such as common practice of treating named-entity (NE) part-of-speech (POS) tagging, our preprocessing detection as a sequence labeling task with the BIO
tagging scheme; that is, each token in the input is (F2) look at the output of previous modules: e.g., tagged with B-x (beginning an NE of type x), I-x the location of nearby medication names as this (inside an NE of type x) and O (outside any NE).
information can be provided by the find_name
module at test time.
3.2.1 The find_name module
As this module identifies medication names only, 3.3 Field linking
the tagset under the BIO scheme has three tags: B-
Once medication names and other fields have been
m for beginning of a name, I-m for inside a name, found, the final step is to form entries by associat-
and O for outside. Various features are used for ing each medication name with its related fields.
this module, which we group into four types:
Our current implementation uses simple heuristics.
First, we go over each non-name field and link it
• (F1) includes word n-gram features (n=1,2,3).
with the closest preceding
medication name unless
For instance, the bigram wi-1 wi looks at the the distance between the non-name field and its
current word and the previous word.
medication name is much shorter.
Second, we assemble the (name, non-name) pairs
• (F2) contains features that check properties of
to form medication entries with a few rules.
the current word and its neighbors (e.g., the
POS tag, the affixes and the length of a word, More information about the modules discussed in
the type of section that a word appears in, this section and features used by the modules is
whether a word is capitalized, whether a word available in (Halgrim, 2009).
is a number, etc.)
4 Experimental results
• (F3) checks the BIO tags of previous words
In this section, we report the performance of our
• (F4) contains features that check whether n-
system on the development set (Section 4.1-4.3)
grams formed by neighboring words appear as and the test set (Section 4.4). The data sets are de-part of medication names in given medication scribed in Table 2. For all the experiments in this name lists. The name lists can come from la-
section, unless specified otherwise, we report exact
beled training data or additional resources such horizontal precision/recall/F-score, the primary as UMLS.
metrics for the i2b2 challenge. For the three modules in the field detection step,
3.2.2 The context_type module
we use the Maximum Entropy (MaxEnt) learner in
This module is a binary classifier which deter-
the Mallet package (McCallum, 2002) because, in
mines whether a medication name occurs in a list general, MaxEnt produces good results without or narrative context. Features used by this module much parameter tuning and the training time for include the section name as identified by the pre-
MaxEnt is much faster than more sophisticated
processing step, the number of commas and words algorithms such as CRF (Lafferty et al., 2001). on the current line, the position of the medication To determine whether the difference between name on the current line, and the current and near-
two systems' performances is statistically signifi-
cant, we use approximate randomization tests (No-reen, 1989) as follows. Given two systems that we
3.2.3 The find_others module
would like to compare, we first calculate the dif-ference between exact horizontal F-scores. Then
This module complements the find_name
module two pseudo-system outputs are generated by ran-
and uses eleven BIO tags to identify five non-name domly swapping (at 0.5 probability) the two sys-
fields. The feature set used in this module is very tem outputs for each discharge summary. If the
similar to the one used in find_name
except that difference between F-scores of these pseudo-
some features in (F2) and (F4) are modified to suit outputs is no less than the original F-score differ-
the needs of the non-name fields. For instance, a ence, a counter, cnt,
is increased by one. This
feature will check whether a word fits a common process was repeated n=10,000 times, and the p-
pattern for dosage. In addition, some features in value of the significance is equal to (cnt+1)/(n+1)
If the p-value is smaller than a predefined thresh-
ing set are used for training. The curve with "+"
old (e.g., 0.05), we conclude that the difference signs represents the results for F1-F4b, and the between the two systems is statistically significant.
curve with circles represents the results for F1-F4a.
The figure illustrates that, as the training data size
4.1 Performance of the whole system
increases, the F-score with both feature sets im-
proves. In addition, the additional resource is most
4.1.1 Effect of feature sets
helpful when the training data size is small, as in-
dicated by the decreasing gap between the two sets
To test the effect of feature sets on system per-
of F-scores when the size of training data in-
formance, we trained find_name
creases. with different feature sets and tested the whole sys-
tem on the development set. For (F4), we used two medication name lists. The first list consists of medication names gathered from the training data. The second list includes drug names from the FDA database (www.accessdata.fda.gov/scripts/cder/ndc/).
use the second list to test whether adding features that check the information in an additional re-source could improve the system performance. The results are in Table 4. For the last two rows, F1-F4a uses the first medication name list, and F1-F4b uses both lists. The F-score difference between all adjacent rows is statistically significant at p≤0.01, except for the pair F1-F3 vs. F1-F4a. It is
not surprising that using the first medication name
Figure 1: System performance on the development
list on top of F1-F3 does not improve the perform-
set with different training data sizes (Legend: ○
ance, as the same kind of information has already
represents F-scores with features in F1-F4a; + rep-
been captured by F1 features. The improvement of
resents F-scores with features in F1-F4b)
F1-F4b over F1-F4a shows that the system can
easily incorporate additional resources and achieve a statistically significant (at p≤0.01) gain.
4.1.3 Pipeline vs. find_all
The current field detection step is a pipeline ap-proach with three modules: find_name
, and find_others
. Having three separate
modules allows each module to choose the features that are most appropriate for it. In addition, later
modules can use features that check the output of the previous modules. A potential downside of the
pipeline system is that the errors in the early mod-
ule would propagate to later modules. An alterna-tive is to use a single module to detect all six field
Table 4: System performance on the development
Figure 2 shows the result of find_all
set with different feature sets
son to the result for the three-module pipeline. Both use the F1-F4b feature sets, except that
4.1.2 Effect of training data size
uses some features that check the out-
Figure 1 shows the system performance on the de-
put of previous modules which are not available to
velopment set when different portions of the train-
Precision Recall F-score
Table 5: The performance (exact preci-sion/recall/F-score) of field detection on the devel-opment set.
Figure 2: Pipeline vs. find_all
for field detection 4.3 Performance of the field linking step
(Legend: ○ represents F-scores with find_all
; + represents F-scores with the three-module pipeline) In order to evaluate the field linking step, we gen-
erated a list of (name, non-name) pairs from the
Interestingly, when 10% of the training set is gold standard, where the name and non-name
used for training, find_all
has a higher F-score than fields appear in the same entry in the gold stan-
the pipeline approach, although the difference is dard. We then compared these pairs with the ones
not statistically significant at p≤
0.05. As more data produced by the field linking step and calculated
is used for training, the pipeline outperforms precision/recall/F-score. Table 6 shows the result find_all
, and when at least 50% of the training data of two experiments: in the cheating experiment, the
is used, the difference between the two is statisti-
input to the field linking step is the fields from the
cally significant at p≤
0.05. One possible explana-
gold standard; in the non-cheating experiment, the
tion for this phenomenon is that as more training input is the output of the field detection step. These data becomes available, the early modules in the experiments show that, while the heuristic rules pipeline make fewer errors; as a result, the disad-
used in this step work reasonably well when the
vantage of the pipeline approach caused by error input is accurate, the performance deteriorates con-propagation is outweighed by the advantage that siderably when the input is noisy, an issue we plan the later modules in the pipeline can use features to address in future work. that check the output of the earlier modules.
4.2 Performance of the field detection step
Table 5 shows the exact
identifying the six field types, using all the training Table 6: The performance of the field linking step
data, F1-F4b features, and the pipeline approach on the development set (cheating: assuming perfect
for field detection. A span in the system output field input; non-cheating: using the output of the exactly matches
a span in the gold standard if the field detection step)
two spans are identical and have the same field
type. Among the six fields, the results for duration 4.4 Results on the test data
and reason are the lowest. That is because duration Table 7 shows the system performance on the i2b2
and reason are longer phrases than the other four official test data. The system was trained on the
field types and there are fewer strong, reliable cues union of the training and development data. Com-
to signal their presence.
pared with the top five i2b2 systems (see Table 3),
When making the narrative/list distinction, the our system was second only to the best i2b2 sys-
accuracy of our context_type
module is 95.4%. In tem, which used more resources and more sophis-
contrast, the accuracy of the baseline (which treats ticated rules for field linking (Patrick and Li,
each medication name as in a list context) is only 2009).
Precision Recall F-score
S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hur-
dle. 2008. Extracting Information from Textual
Documents in the Electronic Health Record: A Re-view of Recent Research. IMIA Yearbook of Medical
Informatics Methods Inf Med 2008
; 47 Suppl 1:128-
Andrew McCallum. 2002. Mallet: A Machine Learning
for Language Toolkit. http://mallet.cs.umass.edu
Table 7: System performance on the test set when Eric W. Noreen. 1989. Computer intensive methods for
trained on the union of the training and the devel-
testing hypotheses: an introduction. John Wiley &
opment sets with F1-F4b features.
Jon Patrick and Min Li, 2009. A Cascade Approach to
Extract Medication Event (i2b2 challenge 2009). Presentation at the Third i2b2 Workshop, November
We present a hybrid system for medication extrac-
2009, San Francisco, CA.
tion. The system is built around a pipeline of cas-cading statistical classifiers for field detection. It Hoifung Poon and Pedro Domingos. 2007. Joint infer-achieves good performance that is comparable to
ence in information extraction. In Proc. of the Na-tional Conference on Artificial Intelligence (AAAI),
the top systems in the i2b2 challenge, and incorpo-
rating additional resources as features further im-proves the performance. In the future, we plan to Benjamin Rozenfeld and Ronen Feldman. 2008. Self-replace the current rule-based field linking module
supervised relation extraction from the web. Knowl-
with a statistical module to improve accuracy.
edge and Information Systems, 17(1):17-33.
Özlem Uzuner, Imre Solti, and Eithon Cadag, 2010a.
Extracting Medication Information from Clinical Text. Submitted to JAMIA.
This work was supported in part by US DOD grant N00244-091-0081
Grants Özlem Uzuner, Imre Solti, Fei Xia, and Eithon Cadag,
2010b. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication
T15LM007442-06. We would also like to thank
Challenge. Submitted to JAMIA.
three anonymous reviewers for helpful comments.
Ben Wellner, Andrew McCallum, Fuchun Peng, and
Michael Hay. 2004. An integrated, conditional model
of information extraction and coreference with appli-
Trond Grenager, Dan Klein, and Christopher Manning.
cation to citation matching. In Proc. of the 20th Con-
2005. Unsupervised learning of field segmentation
ference on Uncertainty in AI (UAI-2004).
models for information extraction. In Proc. of ACL-
Scott Halgrim. 2009. A Pipeline Machine Learning Ap-
proach to Biomedical Information Extraction. Master
Thesis. University of Washington.
J. Lafferty and A. McCallum and F. Pereira. 2001. Con-
ditional random fields: Probabilistic models for seg-menting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learning (ICML-2001).
Matthew A. Levin, Marina Krol, Ankur M. Doshi, and
David L. Reich. 2007. Extraction and mapping of drug names from free text to a standardized nomen-clature. AMIA Symposium Proceedings,
Reducing Maternal Mortality in Tanzania: Health Facility Assessment in Kigoma Summary Health Facility Assessment of Emergency Obstetric & Neonatal Care Services (EmONC) in Kigoma Region, Tanzania: Selected Findings Tanzania has the fourth highest number of maternal deaths in Sub-Saharan Africa and the sixth highest in the world (World Health Organization, 2014). The
Stem and Root Anatomy and Functions. Vegetative Propagation What are root's functions?The three universal functions of all roots are anchorage, absorption and translocation of water withdissolved mineral nutrients. In many perennial and biennial species, roots are also sites for foodstorage. These food reserves keep the plant alive through the non-growing season, and are used toresume growth in spring or after cutting or grazing. Some species that store food in their roots areyams, alfalfa and red clover. Food storage organs of some vegetables (carrots, beets, and radishes) areactually a combination of root and stem tissues.