Skip to Main Content

Adapting a Machine Learning Algorithm to Track Symptoms of COVID-19

May 22, 2020

Adapting a Machine Learning Algorithm to Track Symptoms of COVID-19

 .
  • 00:00Like to introduce our next Speaker,
  • 00:02Doctor Julie Womack.
  • 00:03Doctor Womack is an associate
  • 00:05professor at the Yale School of
  • 00:07Nursing and a Health Sciences
  • 00:09researcher at the West Haven, VA.
  • 00:11She received her PhD in nursing from Yale
  • 00:13University and completed at Post Doctoral
  • 00:15Fellowship in Informatics at the VA.
  • 00:18Doctor Womack, thank you for being here.
  • 00:22Thank you and I can everyone hear me.
  • 00:25I hope, um.
  • 00:27So I'll be talking with you about
  • 00:29the work that colleagues of mine
  • 00:31and I are doing to adapt in LP
  • 00:34pipeline machine learning algorithm
  • 00:36to identify systems of coded 90.
  • 00:40Next Symptoms are of crucial
  • 00:42importance to patients.
  • 00:44They are how an individual
  • 00:46experiences their illness.
  • 00:47For providers and symptoms are markers
  • 00:50that can help to identify disease or
  • 00:53to develop a list of differential.
  • 00:56Recognition of symptoms has been
  • 00:58an important component of coded 19.
  • 01:00The symptoms consider this markers
  • 01:02of the disease has changed overtime.
  • 01:05Initially it was fever,
  • 01:06cough and shortness of breath.
  • 01:08These are still considered to
  • 01:10be the primary symptoms,
  • 01:12but this list has expanded to include
  • 01:15others such as nasal congestion,
  • 01:17sore throat, and osmia.
  • 01:19Like you see a headache,
  • 01:21dizziness, fatigue, muscle aches,
  • 01:23chills and GI symptoms,
  • 01:24including nausea, loss of appetite.
  • 01:26Vomiting and diarrhea next week.
  • 01:30In both Farzin Murs symptom based
  • 01:33case detection and subsequent testing
  • 01:35to guide isolation and Quarantine
  • 01:36with keys and there was minimum
  • 01:39evidence that asymptomatic cases were
  • 01:41important routes of transmission.
  • 01:43With COVID-19 there was potentially
  • 01:45a sizable percentage of cases
  • 01:47that are asymptomatic,
  • 01:48and these have been shown to be
  • 01:50important players in viral transmission.
  • 01:53So symptoms alone are insufficient to
  • 01:55identify cases of Coke in 19 or even
  • 01:58to identify those who should be tested next.
  • 02:01Furthermore,
  • 02:01most of the symptoms experience with
  • 02:04COVID-19 are not unique to code at 19,
  • 02:07but rather are shared by respiratory viruses,
  • 02:10other respiratory viruses
  • 02:11and health conditions.
  • 02:12Here's a graph of coded 19 testing within
  • 02:15the VA Connecticut health care system.
  • 02:18There the number of test is
  • 02:21noted on the vertical axis.
  • 02:23Dates are on the horizontal,
  • 02:25the Green Line represents
  • 02:27negative COVID-19 tests.
  • 02:28The red line is positive tests
  • 02:30and the baseline represents
  • 02:32those with pending results.
  • 02:34So out of the 1600s has done through May 7th,
  • 02:38214 or 12% were positive.
  • 02:40For COVID-19,
  • 02:41these results suggest that the majority
  • 02:44of those with symptoms may not,
  • 02:46in fact have coded 19 next.
  • 02:52But despite the limitations to using
  • 02:54symptoms to diagnose so bad 19 or to
  • 02:57identify those who need to be tested
  • 02:59in Corentine symptoms are still an
  • 03:01important component of the pandemic.
  • 03:03I work with a number of investigators
  • 03:06who are interested in using VA electronic
  • 03:08health record or HR data to study
  • 03:11different aspects of symptoms encoded 19.
  • 03:13The first step for all of these projects
  • 03:15is to develop a reliable approach for
  • 03:18identifying these symptoms in the HR,
  • 03:21a number of possible approaches exists.
  • 03:23These include looking at. Problem with.
  • 03:25ICD codes and inferring symptoms
  • 03:28from prescription data. However,
  • 03:30all of these approaches underestimate
  • 03:32the number and type of symptoms.
  • 03:34Discuss that a visit.
  • 03:36Most documentation of symptoms
  • 03:38takes place in clinical note.
  • 03:40These documented symptoms can be extracted
  • 03:42from text notes using natural language
  • 03:44processing and machine learning algorithms,
  • 03:47and then converted into structured data.
  • 03:49For the purposes of analysis.
  • 03:53So today I'm gonna talk a bit about the
  • 03:55symptom extractor pipeline that we will
  • 03:57adapt to identify COVID-19 symptoms in VA.
  • 03:59Clinical note,
  • 04:00I'm going to talk a bit about what
  • 04:02that adaptation process will look like,
  • 04:04and then I'm going to briefly describe
  • 04:07projects that will build on it work next.
  • 04:10The symptom extractor pipeline that we
  • 04:12will use with originally developed by
  • 04:14Guide Devita and colleagues from the VA,
  • 04:17Salt Lake City health care system.
  • 04:19Next It is a uema natural language
  • 04:23processing pipeline that was assembled
  • 04:26using B3 LP framework components.
  • 04:29Both arena and be free.
  • 04:31An LTR open source software.
  • 04:33You we met short for unstructured
  • 04:36information management architecture is
  • 04:37an Oasis standard for content analytics.
  • 04:40Originally developed at IBM.
  • 04:41The VPN LP framework is a set of
  • 04:45functionality zan components that
  • 04:47provide Java developers the ability
  • 04:50to create novel annotators,
  • 04:52place annotators into pipelines,
  • 04:54and include applications to extract
  • 04:56concepts from clinical text.
  • 04:58These are scale up and scale out
  • 05:01functionality's developed with the
  • 05:03expressed purpose of processing
  • 05:05large numbers of records.
  • 05:07Machine learning annotator was added
  • 05:09at the tail end of the LP pipeline
  • 05:12to enhance the pipeline's ability
  • 05:14to identify through symptoms.
  • 05:16This figure depicts the components
  • 05:18of the Simpson extractor pipeline.
  • 05:20As is typical of Uema Pipeline,
  • 05:22this one is composed of a series of
  • 05:25annotators where the output of one
  • 05:27becomes the input of the next next.
  • 05:32Annotators at the front end of the pipeline
  • 05:35decompose text into document elements.
  • 05:38The Specializer breaks the notes into
  • 05:40sections, so she complaints history
  • 05:43past medical history, medications, etc.
  • 05:45Tokenizer then breaks up the notes
  • 05:47further into component parts, including
  • 05:50for example sentences or phrases next.
  • 05:54The next part of the pipeline identified
  • 05:57templated components of the notes
  • 05:59that require an assertion logic
  • 06:01different from that used in plain text.
  • 06:03Note thanks.
  • 06:07So we're all familiar with
  • 06:09the straightforward soap note
  • 06:10documentation as shown in this sample,
  • 06:12so the subjective and object
  • 06:14information from the patient is noted,
  • 06:16and then assessments in plans are made.
  • 06:19The symptom statements here are
  • 06:21fairly straightforward positive.
  • 06:22For shortness of breath and negative
  • 06:24for pain, chest pain, and palpitation.
  • 06:28Next Check boxes are one form that
  • 06:31templated text can take obvious.
  • 06:33Obviously this is not natural language,
  • 06:35so the logic used to identify symptoms
  • 06:38here must be very different from
  • 06:41that used for a simple soap note.
  • 06:44Here, the condition of interest
  • 06:46is only true if there is a check
  • 06:48next to the concept of inference.
  • 06:50So for example,
  • 06:51in the first section homeless is mentioned,
  • 06:53but the computer needs to recognize
  • 06:55that the individual is only homeless if
  • 06:58there is a check mark next to that box.
  • 07:01Next
  • 07:04For slots and values there is a
  • 07:06templated request. For information here.
  • 07:08Information requested include percent service
  • 07:10connected disability and individuals,
  • 07:12religion, marital status,
  • 07:13living situation, etc. Responses need
  • 07:15to be placed next to the request.
  • 07:18So for example, in line G,
  • 07:20much is in the checkboxes.
  • 07:22The computer needs to recognize
  • 07:25that the individual has children
  • 07:28only if a non 0 number is placed
  • 07:30next to the slot for children.
  • 07:33Next So again,
  • 07:34this part of the pipeline identifies
  • 07:37templated note sections and flag
  • 07:39them so that the computer can use
  • 07:41the appropriate logic to identify
  • 07:44the presence of symptoms next.
  • 07:46The term identification annotator is
  • 07:48the dictionary look up portion of the
  • 07:52pipeline and Dictionary of 92,000 concepts,
  • 07:55or 100 and 22,000 symptom forms
  • 07:58was created from unified medical
  • 08:01language system or you M LS sources.
  • 08:04Terms within this resource are tagged
  • 08:07with a symptom category along with a
  • 08:09set of 15 organ system sub categories.
  • 08:12A Dictionary of idiosyncratic symptom
  • 08:14phrases and symptoms not covered by the
  • 08:17symptom dictionary is also employed next.
  • 08:21In annotator was created specifically
  • 08:23to identify potential symptoms by rules
  • 08:26and patterns formed from annotations
  • 08:28created by the dictionary look up
  • 08:31and document decomposition next.
  • 08:35The context assertion annotator was
  • 08:36included to identifying negation,
  • 08:38so patient denies pain.
  • 08:39It identifies the subject.
  • 08:40So is it the patient who reports
  • 08:43the symptom or someone else?
  • 08:44For example, in the family
  • 08:46history section of the note.
  • 08:48It identifies hypotheticals.
  • 08:50For example,
  • 08:51many medications are prescribed PRN,
  • 08:53PRN pain, or PRN dizziness.
  • 08:56It also identifies whether or not
  • 08:58the symptom is occurring now,
  • 08:59or if it is historical.
  • 09:01So something that occurred in the past,
  • 09:03so a note could say something
  • 09:05like six weeks ago patient
  • 09:06reported o'clock if we were only
  • 09:08looking for current symptoms,
  • 09:09the computer would need to
  • 09:11recognize that this cough is
  • 09:13not current and should not be
  • 09:15flagged as a symptom of interest.
  • 09:17Next
  • 09:20Initially, the dictionary and rule based
  • 09:23mechanisms produced approximately 9
  • 09:24false sense dimensions for each tree.
  • 09:27Symptom identified.
  • 09:28An additional mechanism was needed
  • 09:30to filter down the false positive.
  • 09:33Tail end annotator that employs the
  • 09:35machine learning model trains on 65
  • 09:38features gleaned from the upstream
  • 09:40annotators was developed for this purpose.
  • 09:43This model uses support vector machine
  • 09:46coupled with stochastic gradient descent
  • 09:48as the classification algorithm next.
  • 09:51The original performance metrics
  • 09:52for the model were fairly good,
  • 09:55so precision or positive convicted value
  • 09:57with 0.8 recall or sensitivity with 0.7
  • 10:00and the F measure was zero point 8.
  • 10:03Next So our goal in this initial
  • 10:08project is to adapt this symptom
  • 10:11extraction pipeline to identify COVID-19
  • 10:14symptoms in patients over time next.
  • 10:17Our sample will include veterans
  • 10:19from two well established VA cohort.
  • 10:21The women veterans cohort or
  • 10:23Windex and the VA birth cohort.
  • 10:26We will include individual to tested
  • 10:29positive for COVID-19 and we will include
  • 10:33all of their notes from 2 weeks before
  • 10:36the diagnosis through two weeks after.
  • 10:39Give you a bit of information
  • 10:41on the two cohorts.
  • 10:43With it is a cohort of veterans identified
  • 10:45from the roster of post 911 conflict.
  • 10:48Information from the roster is
  • 10:50available and include separate data,
  • 10:52birth date of last deployment
  • 10:54and armed forces,
  • 10:55branching component roster data
  • 10:57have also been linked to electronic
  • 10:59health record data with its includes
  • 11:02approximately 1.2 million individual.
  • 11:04It represents a younger cohort.
  • 11:06The mean age for women was
  • 11:0829 an for men 30 years,
  • 11:11as is typical in the VA.
  • 11:15As a typical in the VA discovered,
  • 11:18is primarily male, an white.
  • 11:20However, it is important to remember
  • 11:21that within the VA there is richer
  • 11:24racial and ethnic diversity
  • 11:25than in the general population,
  • 11:27particularly among women next.
  • 11:30The VA birth cohort is an EHR based cohort.
  • 11:33It includes all veterans
  • 11:35born between 1945 and 1965,
  • 11:37so these are baby boomer better.
  • 11:39Much older than those than most of those in
  • 11:42with the total sample size is 4.2 million.
  • 11:45The age range is 55 to 75 years and
  • 11:48again it is majority white and male,
  • 11:51but it is important to note that even
  • 11:54though women are only 15% of this cohort,
  • 11:57this represents almost half a
  • 12:00million women next.
  • 12:01In terms of our sample size,
  • 12:04as of May 16th at 5:41 PM,
  • 12:07the cumulative number of coded
  • 12:0819 cases within the VA with
  • 12:11approximately 12,000 next.
  • 12:14So how are you gonna test and adapt our
  • 12:17system pipeline as a first step will
  • 12:19be to restrict the Simpson dictionary
  • 12:21so that the terms included are only
  • 12:23those pertinent to COVID-19 next.
  • 12:27The next step is to run this
  • 12:30restricted symptom extractor
  • 12:31pipeline on all of the notes and to
  • 12:33have clinicians review to result.
  • 12:357 conditions will review a
  • 12:37random subset of 700 note.
  • 12:39Conditions will first create guidelines
  • 12:42for identifying positive and negative
  • 12:44note based on their clinical knowledge
  • 12:47and an initial review of 100 note.
  • 12:49The guidelines will be revised.
  • 12:51Intel Acampe of 0.85 for Inter
  • 12:55rater reliability is achieved.
  • 12:57Each condition will then review
  • 12:59and evaluate a hundred-and-fifty
  • 13:00notes out of the remaining 600
  • 13:02nodes so that each node is reviewed
  • 13:04by at least two clinicians.
  • 13:06We will then compare reviewer assessments
  • 13:08where the two reviewers disagree.
  • 13:09The Pi will make the final decision next.
  • 13:13The third step will be to compare
  • 13:16the symptoms identified by the
  • 13:18pipeline with those identified by
  • 13:19the clinicians in these 700 notes,
  • 13:21and we're targeting precision
  • 13:24and recall at 0.8 next.
  • 13:26If we do not achieve this goal,
  • 13:28there are a number of approaches that we
  • 13:30can use to improve pipeline performance.
  • 13:32The first will be to augment the symptom
  • 13:35terms identified by the dictionary.
  • 13:36To do this,
  • 13:37we will use topic modeling to identify
  • 13:39relevant symptom terms in the note.
  • 13:42Topic modeling is a machine learning
  • 13:44techniques that can be applied to
  • 13:46large corpora to discover themes,
  • 13:48IE symptom topics that are
  • 13:50semantically related.
  • 13:51We can create Raina bidirectional
  • 13:54encoder representations from
  • 13:55Transformers or bird model on 10,000
  • 13:58documents with keywords to boost the
  • 14:00LP's ability to recognize synonyms
  • 14:02related terms and misspelling.
  • 14:04Finally,
  • 14:04we can target the machine learning
  • 14:07component of the pipeline and train
  • 14:10and test support vector machine models
  • 14:12with different configurations next.
  • 14:15We're applying for funding for this
  • 14:17project from the VA rapid response project.
  • 14:20Calls were also submitting a proposal
  • 14:22in response to why a sense called
  • 14:25for intramural pilot gram next.
  • 14:27Once we have adapted the pipeline
  • 14:29to accurately identify COVID-19
  • 14:31symptoms in VAEHR text notes,
  • 14:33there are a number of projects that
  • 14:36we are interested in pursuing next.
  • 14:39The first project will focus on
  • 14:41evaluating the risk of infection
  • 14:43and death associated with SARS, Co.
  • 14:45V2 and influenza in the six months
  • 14:48following the index infection with COVID-19.
  • 14:51So in 19 will be defined as a
  • 14:53positive arc collected at least eight
  • 14:55weeks after the index and affection
  • 14:58and by the presence of symptoms.
  • 15:00This project is led by Doctor Rupert,
  • 15:03got an instruction Infectious Diseases
  • 15:05at the West Haven BA and a yellow Haven.
  • 15:08His mentors include doctors,
  • 15:09Kathleen Aiken,
  • 15:10Cynthia Branson name each up next.
  • 15:14We're also interested in looking at
  • 15:16symptoms versus symptom clusters,
  • 15:18and their associations with Cobit
  • 15:1919 testing and seropositivity.
  • 15:21In particular,
  • 15:22we are interested in exploring whether
  • 15:24symptoms are symptom clusters differ by age,
  • 15:26sex,
  • 15:27race and be a region on the P
  • 15:29on this project,
  • 15:31and I'm working with doctors cut
  • 15:34bacon brands and Justice next.
  • 15:37Additional projects include
  • 15:39Validating an approach to identifying
  • 15:41COVID-19 infection in VA data for
  • 15:43research in Qi purposes that include
  • 15:45the combination of symptoms
  • 15:47or symptom clusters,
  • 15:48results of chest radiographs for CT scans,
  • 15:51an arc testing were also interested
  • 15:53in exploring whether or not we can
  • 15:56use the adapted symptom extractor
  • 15:57as the foundation for an EHR based
  • 16:00bio surveillance system to identify
  • 16:03the onset of new code.
  • 16:0519 searches were interested in seeing
  • 16:07whether or not this symptom extractor.
  • 16:09Can be adapted to other electronic
  • 16:11health records such as epics,
  • 16:13into other electronic data
  • 16:15sources such as Google.
  • 16:17Finally, we're interested in
  • 16:19looking at associations between
  • 16:20symptoms and symptom clusters.
  • 16:22With code 19 viral load next.
  • 16:26All the work that I've described
  • 16:28as the product of team science,
  • 16:30members of the team are from Yale,
  • 16:32the School of Nursing,
  • 16:33and the school of Madison,
  • 16:35George Washington University and OHSU next.
  • 16:37Thank you much.
  • 16:38Thank you very much for your time.
  • 16:46Thank you very much.