Resource Library

Natural Language Processing (NLP) Tools


Apache cTAKES Webinar: August 29, 2013

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. Development team located at Mayo Clinic & Boston Children’s Hospital.

MedEx Webinar: September 5, 2013

MedEx-UIMA is an open source tool for extracting medication and signature information from clinical text. It is a java implementation of the existing MedEx system (in Python) based on the UIMA framework. Development team located at Vanderbilt University and University of Texas Health Science Center at Houston.

Natural Language Processing (NLP) Survey of Tools & Resources


General frameworks

  • Apache Unstructured Information Management Architecture (UIMA): Java framework for developing NLP pipelines, released under the Apache 2 license. UIMA provides Eclipse plug-ins for developing and testing UIMA-based applications. UIMA wrappers exist for a variety of other Java-based NLP component libraries.
  • General Architecture for Text Engineering (GATE): Java framework for developing NLP pipelines, developed at the University of Sheffield (UK). GATE includes a number of rule-based NLP components, and GATE wrappers exist for a variety of other Java-based NLP libraries
  • Natural Language Toolkit (NLTK): A Python library for developing NLP applications. This framework is accompanied by a book, which is useful for pedagogical purposes.

NLP components, pipelines, and tools

  • clinical Text and Knowledge Extraction System (cTAKES): cTAKES is built on top of Apache UIMA, and is composed of sets of UIMA processors that are assembled together into pipelines. Some of the processors are wrappers for Apache OpenNLP components, and some are custom built. cTAKES was developed at the Mayo Clinic, and is distributed by the Open Health NLP Consortium.
  • Health Information Text Extraction (HITEX): HITEx was developed as part of the i2b2 project. It is a rule-based NLP pipeline based on the GATE framework.
  • Computational Language and Education Research toolkit (cleartk): cleartk has been developed at the University of Colorado at Boulder, and provides a framework for developing statistical NLP components in Java. It is built on top of Apache UIMA.
  • NegEx (NegEx): NegEx is a tool developed at the University of Pittsburgh to detect negated terms from clinical text. The system utilizes trigger terms as a method to determine likely negation scenarios within a sentence.
  • ConText (ConText): ConText is an extension to NegEx, and is also developed by the University of Pittsburgh. ConText extends NegEx to not only detect negated concepts, but to also find temporality (recent, historical or hypothetical scenarios) and who the experiencer is (patient or other) of the concept.
  • National Library of Medicine’s MetaMap (MetaMap): MetaMap is a comprehensive concept tagging system which is built on top of the Unified Medical Language System (UMLS). It requires an active UMLS Metathesaurus License Agreement for use. The program may execute by itself, although there has been done some work to create a UIMA Wrapper to allow MetaMap to act as a UIMA component.
  • MedEx – a tool for extraction medication information from clinical text (MedEx): MedEx processes free-text clinical records to recognize medication names and signature information, such as drug dose, frequency, route, and duration. Use is free with a UMLS license. It is a standalone application for Linux and Windows.
  • SecTag – section tagging hierarchy (SecTag): SecTag recognizes note section headers using NLP, Bayesian, spelling correction, and scoring techniques. The link here includes the SQL and CSV files for the section terminologies. Use is free with either a UMLS or LOINC license.
  • Stanford Named Entity Recognizer (NER): Stanford’s NER is a Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition in English and German.
  • Stanford CoreNLP (CoreNLP): Stanford CoreNLP is an integrated suite of natural language processing tools for English in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference.

Software and Tools used by eMERGE network sites

  • Cincinnati Children’s Hospital: a custom pipeline built around cTAKES
  • Mayo Clinic: the prevalent tool is UIMA based cTAKES. Latest open source tools being developed at Mayo Clinic in collaboration with the SHARP consortium can be found
  • Northwestern University: Most of our NLP applications have been based on Apache UIMA, though we have also used HITEx (GATE). Our data processing workflows use the Kontanz Information Miner (KNIME) to extract data from our Enterprise Data Warehouse, including both structured data and text. Textual data are fed into a custom KNIME node that executes a UIMA-based NLP application to extract information for a particular phenotype. The output of processing the text is structured data, which are merged with other structured data and fed into a phenotype classification algorithm.
  • Vanderbilt University: We use a combination of SecTag, MedEx, and the KnowledgeMap Concept Identifier, which maps free text to UMLS concepts, including some statistical disambiguation. We have developed a SOAP and REST webservice version of KnowledgeMap, SecTag, and NegEx that has been interfaced with the KNIME interface.