Medical Care |

Medical Care



Extracting medication information from discharge summaries

Extracting Medication Information from Discharge Summaries
Scott Halgrim, Fei Xia, Imre Solti, Eithon Cadag
Özlem Uzuner
University of Washington University of Albany, SUNY Seattle, WA 98195, USA Albany, NY 12222, USA (, a task Abstract
we refer to as the i2b2 challenge in this paper.
In the past decade, there has been extensive re-
Extracting medication information from search on information extraction in both the gen- clinical records has many potential appli- eral and biomedical domains (Wellner et al., 2004; cations and was the focus of the i2b2 Grenager et al., 2005; Poon and Domingos, 2007; challenge in 2009. We present a hybrid Meystre et al, 2008; Rozenfeld and Feldman, system, comprised of machine learning 2008). Interestingly, despite the recent prevalence and rule-based modules, for medication of statistical approaches in most NLP tasks (in- information extraction. With only a hand- cluding information extraction), most of the sys- ful of template-filling rules, the system's tems developed for the i2b2 challenge were rule- core is a cascade of statistical classifiers based. In this paper we present our hybrid system, for field detection. It achieved good per- whose core is a cascade of statistical classifiers that formance that was comparable to the top identify medication fields such as medication systems in the i2b2 challenge, demon- names and dosages. The fields are then assembled strating that a heavily statistical ap- to form medication entries. While our system did proach can perform as well or better than not participate in the i2b2 challenge (as we were systems with many sophisticated rules. part of the organizing team), it achieved good re- The system can easily incorporate addi- sults that matched the top i2b2 systems. tional resources such as medication name lists to further improve performance. 2 The i2b2 Challenge
This section provides a brief introduction to the 1 Introduction
Narrative clinical records store patient medical 2.1 The task
information, and extracting this information from these narratives supports data management and The i2b2 challenge studied the automatic extrac-enables many applications (Levin et al., 2007). tion of information corresponding to the following Informatics for Integrating the Biology and the fields from hospital discharge summaries (Uzuner, Bedside (i2b2) is an NIH-funded National Center et al., 2010a): names of medications (m) taken by for Biomedical Computing based at Partners the patient, dosages (do), modes (mo), frequencies HealthCare System, and it has organized annual (f), durations (du), and reasons (r) for taking these NLP shared tasks and challenges since 2006 medications. We refer to the medication field as ( The Third i2b2 Workshop the name field and the other five fields as the non-on NLP Challenges for Clinical Records in 2009 name fields. All non-name fields correspond to studied the extraction of medication information some name field mention; if they were specified from summaries within a two-line window of that name mention, Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents, pages 61–67, Los Angeles, California, June 2010. c 2010 Association for Computational Linguistics the i2b2 challenge required such fields to be linked to the name field to form an entry. For each entry, • The participating teams produced system out- a system must determine whether the entry ap- puts for 547 discharge summaries set aside for peared in a list of medications or in narrative text. testing. After the challenge, 251 of these sum- Table 1 shows an excerpt from a discharge sum- maries were annotated by the challenge par- mary and the corresponding entries in the gold ticipants, and these 251 summaries formed the standard. The first entry appears in narrative text, final test set (Uzuner et al., 2010b). and the second in a list of medication information. The sizes of the data sets used in our experiments Excerpt of Discharge Summary
are shown in Table 2. The training and develop- 55 the patient noted that he had a recurrence of this ment sets were created by the University of Syd- 56 vague chest discomfort as he was sitting and ney, and the test data is the i2b2 official challenge 57 talking to friends. He took a sublingual test set. The average number of entries and fields 58 Nitroglycerin without relief. vary among the three sets because the summaries in the test set were chosen randomly from the 547 65 Flomax ( Tamsulosin ) 0.4 mg, po, qd,. summaries, whereas the University of Sydney team Gold standard:
annotated the longest summaries. m="Nitroglycerin" 58:0 58:0 do="nm" mo="sublingual" 57:6 57:6 f="nm" du="nm" r="vague chest discomfort" 56:0 56:2 2.3 Additional resources
Besides the training data, the participating teams were allowed to use any additional tools and re- m="flomax ( tamsulosin )" 65:0 65:3 do="0.4 mg" 65:4 65:5 mo="po" 65:6 65:6 f="qd" 65:7 sources that they had access to, including resources 65:7 du="nm" r="nm" ln="list" not available to the public. All challenge partici- pants used additional resources such as UMLS Table 1: A sample discharge summary excerpt and (, but the exact the corresponding entries in the gold standard. The resources used varied from team to team. There-fields inside an entry are separated by " ". Each fore, the challenge was similar to the so-called field is represented by the string and its position open-track challenge in the general NLP field, as (i.e., "line number: token number" offsets). "nm" opposed to a closed-track challenge that could re-means the field value is not mentioned for this me- quire that all the participants use only the list of resources specified by the challenge organizers 2.2 Data Sets
2.4 Evaluation metrics
The i2b2 challenge used a total of 1243 discharge
The i2b2 challenge used two sets of evaluation metrics: horizontal and vertical metrics. Horizon- • 696 of these summaries were released to par- ticipants for system development, and the i2b2 tal metrics measured system performance at the organizing team provided the gold standard entry level, whereas vertical metrics measured sys- tem performance at the field level. Both sets of annotation for 17 of them. metrics compared the system output and the gold standard at the span level for exact match and at • Participating teams could choose to annotate the token level for inexact match, using precision, more files themselves. The University of Syd-ney team annotated 145 out of the 696 summa- recall, and F-score (Uzuner et al., 2010a). The pri- ries (including re-annotating 14 of the 17 files mary metric for the challenge is exact horizontal F-annotated by the i2b2 organizing team) and score, which is the metric we use to evaluate our generously shared their annotations with i2b2 after the challenge for future research. We ob- tained and used 110 of their annotations as our training set and the remaining 35 summaries as our development set. Table 2: The data sets used in our experiments. The numbers in parentheses are the average numbers of entries or fields per discharge summary. step includes a section segmenter that breaks dis- 2.5 Participating systems
charge summaries into sections. Discharge summa- Twenty teams participated in the challenge. Fifteen ries tend to consist of sections such as ‘ADMIT teams used rule-based approaches, and the rest DIAGNOSIS', ‘PAST MEDICAL HISTORY', used statistical or hybrid approaches. The perform- and ‘DISCHARGE MEDICATIONS'. Knowing ances of the top five systems are shown in Table 3. section boundaries is important for the i2b2 chal-Among them, only the top system, developed by lenge because, according to the annotation guide-the University of Sydney, used a hybrid approach, lines for creating the gold standard, medications whereas the rest were rule-based. occurring under certain sections (e.g., family his- tory and allergic reaction) should be excluded from the system output. Furthermore, knowing the types of sections could be useful for field detection and field linking; for example, entries in the ‘DISCHARGE MEDICATIONS' section are more likely to appear in a list of medications than in nar-rative text. The set of sections and the exact spelling of section headings vary across discharge summaries. The section segmenter uses regular expressions (e.g., ‘ s*([A-Z s]+):' -- a line starting with a se- Table 3: The performance (exact horizontal preci- quence of capitalized words followed by a colon) sion/recall/F-score) of the top five i2b2 systems on to collect potential section headings from the train-the test set. ing data, and the headings whose frequencies are higher than a threshold are used to identify section 3 System description
boundaries in the discharge summaries. 3.2 Field detection
We developed a hybrid system with three process-ing steps: (1) a preprocessing step that finds sec- This step consists of three modules: the first mod- tion boundaries, (2) a field detection step that ule, find_name, finds medication names in a dis-identifies the six fields, and (3) a field linking step charge summary; the second module, context_type, that links fields together to form entries. The sec- processes each medication name identified by ond step is a statistical system, whereas the other find_name and determines whether the medication two steps are rule-based. The second step was the appears in narrative text or in a list of medications; main focus of this study. the third module, find_others, detects the five non-name field types. 3.1 Preprocessing
For find_name and find_others, we follow the In addition to common processing steps such as common practice of treating named-entity (NE) part-of-speech (POS) tagging, our preprocessing detection as a sequence labeling task with the BIO tagging scheme; that is, each token in the input is (F2) look at the output of previous modules: e.g., tagged with B-x (beginning an NE of type x), I-x the location of nearby medication names as this (inside an NE of type x) and O (outside any NE). information can be provided by the find_name module at test time. 3.2.1 The find_name module
As this module identifies medication names only, 3.3 Field linking
the tagset under the BIO scheme has three tags: B-
Once medication names and other fields have been m for beginning of a name, I-m for inside a name, found, the final step is to form entries by associat- and O for outside. Various features are used for ing each medication name with its related fields. this module, which we group into four types: Our current implementation uses simple heuristics. First, we go over each non-name field and link it • (F1) includes word n-gram features (n=1,2,3). with the closest preceding medication name unless For instance, the bigram wi-1 wi looks at the the distance between the non-name field and its current word and the previous word. closest following medication name is much shorter. Second, we assemble the (name, non-name) pairs • (F2) contains features that check properties of to form medication entries with a few rules. the current word and its neighbors (e.g., the POS tag, the affixes and the length of a word, More information about the modules discussed in the type of section that a word appears in, this section and features used by the modules is whether a word is capitalized, whether a word available in (Halgrim, 2009). is a number, etc.) 4 Experimental results
• (F3) checks the BIO tags of previous words In this section, we report the performance of our • (F4) contains features that check whether n- system on the development set (Section 4.1-4.3) grams formed by neighboring words appear as and the test set (Section 4.4). The data sets are de-part of medication names in given medication scribed in Table 2. For all the experiments in this name lists. The name lists can come from la- section, unless specified otherwise, we report exact beled training data or additional resources such horizontal precision/recall/F-score, the primary as UMLS. metrics for the i2b2 challenge. For the three modules in the field detection step, 3.2.2 The context_type module
we use the Maximum Entropy (MaxEnt) learner in This module is a binary classifier which deter- the Mallet package (McCallum, 2002) because, in mines whether a medication name occurs in a list general, MaxEnt produces good results without or narrative context. Features used by this module much parameter tuning and the training time for include the section name as identified by the pre- MaxEnt is much faster than more sophisticated processing step, the number of commas and words algorithms such as CRF (Lafferty et al., 2001). on the current line, the position of the medication To determine whether the difference between name on the current line, and the current and near- two systems' performances is statistically signifi- cant, we use approximate randomization tests (No-reen, 1989) as follows. Given two systems that we 3.2.3 The find_others module
would like to compare, we first calculate the dif-ference between exact horizontal F-scores. Then This module complements the find_name module two pseudo-system outputs are generated by ran- and uses eleven BIO tags to identify five non-name domly swapping (at 0.5 probability) the two sys- fields. The feature set used in this module is very tem outputs for each discharge summary. If the similar to the one used in find_name except that difference between F-scores of these pseudo- some features in (F2) and (F4) are modified to suit outputs is no less than the original F-score differ- the needs of the non-name fields. For instance, a ence, a counter, cnt, is increased by one. This feature will check whether a word fits a common process was repeated n=10,000 times, and the p- pattern for dosage. In addition, some features in value of the significance is equal to (cnt+1)/(n+1). If the p-value is smaller than a predefined thresh- ing set are used for training. The curve with "+" old (e.g., 0.05), we conclude that the difference signs represents the results for F1-F4b, and the between the two systems is statistically significant. curve with circles represents the results for F1-F4a. The figure illustrates that, as the training data size 4.1 Performance of the whole system
increases, the F-score with both feature sets im- proves. In addition, the additional resource is most 4.1.1 Effect of feature sets
helpful when the training data size is small, as in- dicated by the decreasing gap between the two sets To test the effect of feature sets on system per- of F-scores when the size of training data in- formance, we trained find_name and find_others creases. with different feature sets and tested the whole sys- tem on the development set. For (F4), we used two medication name lists. The first list consists of medication names gathered from the training data. The second list includes drug names from the FDA database ( use the second list to test whether adding features that check the information in an additional re-source could improve the system performance. The results are in Table 4. For the last two rows, F1-F4a uses the first medication name list, and F1-F4b uses both lists. The F-score difference between all adjacent rows is statistically significant at p≤0.01, except for the pair F1-F3 vs. F1-F4a. It is not surprising that using the first medication name Figure 1: System performance on the development list on top of F1-F3 does not improve the perform- set with different training data sizes (Legend: ○ ance, as the same kind of information has already represents F-scores with features in F1-F4a; + rep- been captured by F1 features. The improvement of resents F-scores with features in F1-F4b) F1-F4b over F1-F4a shows that the system can easily incorporate additional resources and achieve a statistically significant (at p≤0.01) gain. 4.1.3 Pipeline vs. find_all
The current field detection step is a pipeline ap-proach with three modules: find_name, con- text_type, and find_others. Having three separate modules allows each module to choose the features that are most appropriate for it. In addition, later modules can use features that check the output of the previous modules. A potential downside of the pipeline system is that the errors in the early mod- ule would propagate to later modules. An alterna-tive is to use a single module to detect all six field Table 4: System performance on the development Figure 2 shows the result of find_all in compari- set with different feature sets son to the result for the three-module pipeline. Both use the F1-F4b feature sets, except that 4.1.2 Effect of training data size
find_others uses some features that check the out- Figure 1 shows the system performance on the de- put of previous modules which are not available to velopment set when different portions of the train- find_all. Precision Recall F-score Table 5: The performance (exact preci-sion/recall/F-score) of field detection on the devel-opment set. Figure 2: Pipeline vs. find_all for field detection 4.3 Performance of the field linking step
(Legend: ○ represents F-scores with find_all; + represents F-scores with the three-module pipeline) In order to evaluate the field linking step, we gen- erated a list of (name, non-name) pairs from the Interestingly, when 10% of the training set is gold standard, where the name and non-name
used for training, find_all has a higher F-score than fields appear in the same entry in the gold stan-
the pipeline approach, although the difference is dard. We then compared these pairs with the ones
not statistically significant at p0.05. As more data produced by the field linking step and calculated
is used for training, the pipeline outperforms precision/recall/F-score. Table 6 shows the result
find_all, and when at least 50% of the training data of two experiments: in the cheating experiment, the
is used, the difference between the two is statisti-
input to the field linking step is the fields from the cally significant at p0.05. One possible explana-
gold standard; in the non-cheating experiment, the tion for this phenomenon is that as more training input is the output of the field detection step. These data becomes available, the early modules in the experiments show that, while the heuristic rules pipeline make fewer errors; as a result, the disad- used in this step work reasonably well when the vantage of the pipeline approach caused by error input is accurate, the performance deteriorates con-propagation is outweighed by the advantage that siderably when the input is noisy, an issue we plan the later modules in the pipeline can use features to address in future work. that check the output of the earlier modules. Precision Recall 4.2 Performance of the field detection step
Non-cheating 87.4 Table 5 shows the exact precision/recall/F-score on identifying the six field types, using all the training Table 6: The performance of the field linking step
data, F1-F4b features, and the pipeline approach on the development set (cheating: assuming perfect
for field detection. A span in the system output field input; non-cheating: using the output of the
exactly matches a span in the gold standard if the field detection step)
two spans are identical and have the same field
type. Among the six fields, the results for duration 4.4 Results on the test data
and reason are the lowest. That is because duration Table 7 shows the system performance on the i2b2
and reason are longer phrases than the other four official test data. The system was trained on the field types and there are fewer strong, reliable cues union of the training and development data. Com- to signal their presence. pared with the top five i2b2 systems (see Table 3), When making the narrative/list distinction, the our system was second only to the best i2b2 sys- accuracy of our context_type module is 95.4%. In tem, which used more resources and more sophis- contrast, the accuracy of the baseline (which treats ticated rules for field linking (Patrick and Li, each medication name as in a list context) is only 2009). Precision Recall F-score Horizontal 88.6
S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hur- dle. 2008. Extracting Information from Textual Documents in the Electronic Health Record: A Re-view of Recent Research. IMIA Yearbook of Medical Informatics Methods Inf Med 2008; 47 Suppl 1:128- Andrew McCallum. 2002. Mallet: A Machine Learning for Language Toolkit. Table 7: System performance on the test set when Eric W. Noreen. 1989. Computer intensive methods for trained on the union of the training and the devel- testing hypotheses: an introduction. John Wiley & opment sets with F1-F4b features. Jon Patrick and Min Li, 2009. A Cascade Approach to 5 Conclusion
Extract Medication Event (i2b2 challenge 2009). Presentation at the Third i2b2 Workshop, November We present a hybrid system for medication extrac- 2009, San Francisco, CA. tion. The system is built around a pipeline of cas-cading statistical classifiers for field detection. It Hoifung Poon and Pedro Domingos. 2007. Joint infer-achieves good performance that is comparable to ence in information extraction. In Proc. of the Na-tional Conference on Artificial Intelligence (AAAI), the top systems in the i2b2 challenge, and incorpo- rating additional resources as features further im-proves the performance. In the future, we plan to Benjamin Rozenfeld and Ronen Feldman. 2008. Self-replace the current rule-based field linking module supervised relation extraction from the web. Knowl- with a statistical module to improve accuracy. edge and Information Systems, 17(1):17-33. Özlem Uzuner, Imre Solti, and Eithon Cadag, 2010a. Extracting Medication Information from Clinical Text. Submitted to JAMIA. This work was supported in part by US DOD grant N00244-091-0081 Grants Özlem Uzuner, Imre Solti, Fei Xia, and Eithon Cadag, 1K99LM010227-0110, 2010b. Community Annotation Experiment for Ground Truth Generation for the i2b2 Medication T15LM007442-06. We would also like to thank Challenge. Submitted to JAMIA. three anonymous reviewers for helpful comments. Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. 2004. An integrated, conditional model References
of information extraction and coreference with appli- Trond Grenager, Dan Klein, and Christopher Manning. cation to citation matching. In Proc. of the 20th Con- 2005. Unsupervised learning of field segmentation ference on Uncertainty in AI (UAI-2004). models for information extraction. In Proc. of ACL- Scott Halgrim. 2009. A Pipeline Machine Learning Ap- proach to Biomedical Information Extraction. Master Thesis. University of Washington. J. Lafferty and A. McCallum and F. Pereira. 2001. Con- ditional random fields: Probabilistic models for seg-menting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learning (ICML-2001). Matthew A. Levin, Marina Krol, Ankur M. Doshi, and David L. Reich. 2007. Extraction and mapping of drug names from free text to a standardized nomen-clature. AMIA Symposium Proceedings, pp 438-442.


Reducing Maternal Mortality in Tanzania: Health Facility Assessment in Kigoma Summary Health Facility Assessment of Emergency Obstetric & Neonatal Care Services (EmONC) in Kigoma Region, Tanzania: Selected Findings Tanzania has the fourth highest number of maternal deaths in Sub-Saharan Africa and the sixth highest in the world (World Health Organization, 2014). The

Stem and root anatomy and functions. vegetative propagation

Stem and Root Anatomy and Functions. Vegetative Propagation What are root's functions?The three universal functions of all roots are anchorage, absorption and translocation of water withdissolved mineral nutrients. In many perennial and biennial species, roots are also sites for foodstorage. These food reserves keep the plant alive through the non-growing season, and are used toresume growth in spring or after cutting or grazing. Some species that store food in their roots areyams, alfalfa and red clover. Food storage organs of some vegetables (carrots, beets, and radishes) areactually a combination of root and stem tissues.