Adapting a Machine Learning Algorithm to Track Symptoms of COVID-19

May 22, 2020

Information

Julie Womack PhD, CNM, FNP (BC) Associate Professor of Nursing

ID5232

To CiteDCA Citation Guide

00:00Like to introduce our next Speaker,
00:02Doctor Julie Womack.
00:03Doctor Womack is an associate
00:05professor at the Yale School of
00:07Nursing and a Health Sciences
00:09researcher at the West Haven, VA.
00:11She received her PhD in nursing from Yale
00:13University and completed at Post Doctoral
00:15Fellowship in Informatics at the VA.
00:18Doctor Womack, thank you for being here.
00:22Thank you and I can everyone hear me.
00:25I hope, um.
00:27So I'll be talking with you about
00:29the work that colleagues of mine
00:31and I are doing to adapt in LP
00:34pipeline machine learning algorithm
00:36to identify systems of coded 90.
00:40Next Symptoms are of crucial
00:42importance to patients.
00:44They are how an individual
00:46experiences their illness.
00:47For providers and symptoms are markers
00:50that can help to identify disease or
00:53to develop a list of differential.
00:56Recognition of symptoms has been
00:58an important component of coded 19.
01:00The symptoms consider this markers
01:02of the disease has changed overtime.
01:05Initially it was fever,
01:06cough and shortness of breath.
01:08These are still considered to
01:10be the primary symptoms,
01:12but this list has expanded to include
01:15others such as nasal congestion,
01:17sore throat, and osmia.
01:19Like you see a headache,
01:21dizziness, fatigue, muscle aches,
01:23chills and GI symptoms,
01:24including nausea, loss of appetite.
01:26Vomiting and diarrhea next week.
01:30In both Farzin Murs symptom based
01:33case detection and subsequent testing
01:35to guide isolation and Quarantine
01:36with keys and there was minimum
01:39evidence that asymptomatic cases were
01:41important routes of transmission.
01:43With COVID-19 there was potentially
01:45a sizable percentage of cases
01:47that are asymptomatic,
01:48and these have been shown to be
01:50important players in viral transmission.
01:53So symptoms alone are insufficient to
01:55identify cases of Coke in 19 or even
01:58to identify those who should be tested next.
02:01Furthermore,
02:01most of the symptoms experience with
02:04COVID-19 are not unique to code at 19,
02:07but rather are shared by respiratory viruses,
02:10other respiratory viruses
02:11and health conditions.
02:12Here's a graph of coded 19 testing within
02:15the VA Connecticut health care system.
02:18There the number of test is
02:21noted on the vertical axis.
02:23Dates are on the horizontal,
02:25the Green Line represents
02:27negative COVID-19 tests.
02:28The red line is positive tests
02:30and the baseline represents
02:32those with pending results.
02:34So out of the 1600s has done through May 7th,
02:38214 or 12% were positive.
02:40For COVID-19,
02:41these results suggest that the majority
02:44of those with symptoms may not,
02:46in fact have coded 19 next.
02:52But despite the limitations to using
02:54symptoms to diagnose so bad 19 or to
02:57identify those who need to be tested
02:59in Corentine symptoms are still an
03:01important component of the pandemic.
03:03I work with a number of investigators
03:06who are interested in using VA electronic
03:08health record or HR data to study
03:11different aspects of symptoms encoded 19.
03:13The first step for all of these projects
03:15is to develop a reliable approach for
03:18identifying these symptoms in the HR,
03:21a number of possible approaches exists.
03:23These include looking at. Problem with.
03:25ICD codes and inferring symptoms
03:28from prescription data. However,
03:30all of these approaches underestimate
03:32the number and type of symptoms.
03:34Discuss that a visit.
03:36Most documentation of symptoms
03:38takes place in clinical note.
03:40These documented symptoms can be extracted
03:42from text notes using natural language
03:44processing and machine learning algorithms,
03:47and then converted into structured data.
03:49For the purposes of analysis.
03:53So today I'm gonna talk a bit about the
03:55symptom extractor pipeline that we will
03:57adapt to identify COVID-19 symptoms in VA.
03:59Clinical note,
04:00I'm going to talk a bit about what
04:02that adaptation process will look like,
04:04and then I'm going to briefly describe
04:07projects that will build on it work next.
04:10The symptom extractor pipeline that we
04:12will use with originally developed by
04:14Guide Devita and colleagues from the VA,
04:17Salt Lake City health care system.
04:19Next It is a uema natural language
04:23processing pipeline that was assembled
04:26using B3 LP framework components.
04:29Both arena and be free.
04:31An LTR open source software.
04:33You we met short for unstructured
04:36information management architecture is
04:37an Oasis standard for content analytics.
04:40Originally developed at IBM.
04:41The VPN LP framework is a set of
04:45functionality zan components that
04:47provide Java developers the ability
04:50to create novel annotators,
04:52place annotators into pipelines,
04:54and include applications to extract
04:56concepts from clinical text.
04:58These are scale up and scale out
05:01functionality's developed with the
05:03expressed purpose of processing
05:05large numbers of records.
05:07Machine learning annotator was added
05:09at the tail end of the LP pipeline
05:12to enhance the pipeline's ability
05:14to identify through symptoms.
05:16This figure depicts the components
05:18of the Simpson extractor pipeline.
05:20As is typical of Uema Pipeline,
05:22this one is composed of a series of
05:25annotators where the output of one
05:27becomes the input of the next next.
05:32Annotators at the front end of the pipeline
05:35decompose text into document elements.
05:38The Specializer breaks the notes into
05:40sections, so she complaints history
05:43past medical history, medications, etc.
05:45Tokenizer then breaks up the notes
05:47further into component parts, including
05:50for example sentences or phrases next.
05:54The next part of the pipeline identified
05:57templated components of the notes
05:59that require an assertion logic
06:01different from that used in plain text.
06:03Note thanks.
06:07So we're all familiar with
06:09the straightforward soap note
06:10documentation as shown in this sample,
06:12so the subjective and object
06:14information from the patient is noted,
06:16and then assessments in plans are made.
06:19The symptom statements here are
06:21fairly straightforward positive.
06:22For shortness of breath and negative
06:24for pain, chest pain, and palpitation.
06:28Next Check boxes are one form that
06:31templated text can take obvious.
06:33Obviously this is not natural language,
06:35so the logic used to identify symptoms
06:38here must be very different from
06:41that used for a simple soap note.
06:44Here, the condition of interest
06:46is only true if there is a check
06:48next to the concept of inference.
06:50So for example,
06:51in the first section homeless is mentioned,
06:53but the computer needs to recognize
06:55that the individual is only homeless if
06:58there is a check mark next to that box.
07:01Next
07:04For slots and values there is a
07:06templated request. For information here.
07:08Information requested include percent service
07:10connected disability and individuals,
07:12religion, marital status,
07:13living situation, etc. Responses need
07:15to be placed next to the request.
07:18So for example, in line G,
07:20much is in the checkboxes.
07:22The computer needs to recognize
07:25that the individual has children
07:28only if a non 0 number is placed
07:30next to the slot for children.
07:33Next So again,
07:34this part of the pipeline identifies
07:37templated note sections and flag
07:39them so that the computer can use
07:41the appropriate logic to identify
07:44the presence of symptoms next.
07:46The term identification annotator is
07:48the dictionary look up portion of the
07:52pipeline and Dictionary of 92,000 concepts,
07:55or 100 and 22,000 symptom forms
07:58was created from unified medical
08:01language system or you M LS sources.
08:04Terms within this resource are tagged
08:07with a symptom category along with a
08:09set of 15 organ system sub categories.
08:12A Dictionary of idiosyncratic symptom
08:14phrases and symptoms not covered by the
08:17symptom dictionary is also employed next.
08:21In annotator was created specifically
08:23to identify potential symptoms by rules
08:26and patterns formed from annotations
08:28created by the dictionary look up
08:31and document decomposition next.
08:35The context assertion annotator was
08:36included to identifying negation,
08:38so patient denies pain.
08:39It identifies the subject.
08:40So is it the patient who reports
08:43the symptom or someone else?
08:44For example, in the family
08:46history section of the note.
08:48It identifies hypotheticals.
08:50For example,
08:51many medications are prescribed PRN,
08:53PRN pain, or PRN dizziness.
08:56It also identifies whether or not
08:58the symptom is occurring now,
08:59or if it is historical.
09:01So something that occurred in the past,
09:03so a note could say something
09:05like six weeks ago patient
09:06reported o'clock if we were only
09:08looking for current symptoms,
09:09the computer would need to
09:11recognize that this cough is
09:13not current and should not be
09:15flagged as a symptom of interest.
09:17Next
09:20Initially, the dictionary and rule based
09:23mechanisms produced approximately 9
09:24false sense dimensions for each tree.
09:27Symptom identified.
09:28An additional mechanism was needed
09:30to filter down the false positive.
09:33Tail end annotator that employs the
09:35machine learning model trains on 65
09:38features gleaned from the upstream
09:40annotators was developed for this purpose.
09:43This model uses support vector machine
09:46coupled with stochastic gradient descent
09:48as the classification algorithm next.
09:51The original performance metrics
09:52for the model were fairly good,
09:55so precision or positive convicted value
09:57with 0.8 recall or sensitivity with 0.7
10:00and the F measure was zero point 8.
10:03Next So our goal in this initial
10:08project is to adapt this symptom
10:11extraction pipeline to identify COVID-19
10:14symptoms in patients over time next.
10:17Our sample will include veterans
10:19from two well established VA cohort.
10:21The women veterans cohort or
10:23Windex and the VA birth cohort.
10:26We will include individual to tested
10:29positive for COVID-19 and we will include
10:33all of their notes from 2 weeks before
10:36the diagnosis through two weeks after.
10:39Give you a bit of information
10:41on the two cohorts.
10:43With it is a cohort of veterans identified
10:45from the roster of post 911 conflict.
10:48Information from the roster is
10:50available and include separate data,
10:52birth date of last deployment
10:54and armed forces,
10:55branching component roster data
10:57have also been linked to electronic
10:59health record data with its includes
11:02approximately 1.2 million individual.
11:04It represents a younger cohort.
11:06The mean age for women was
11:0829 an for men 30 years,
11:11as is typical in the VA.
11:15As a typical in the VA discovered,
11:18is primarily male, an white.
11:20However, it is important to remember
11:21that within the VA there is richer
11:24racial and ethnic diversity
11:25than in the general population,
11:27particularly among women next.
11:30The VA birth cohort is an EHR based cohort.
11:33It includes all veterans
11:35born between 1945 and 1965,
11:37so these are baby boomer better.
11:39Much older than those than most of those in
11:42with the total sample size is 4.2 million.
11:45The age range is 55 to 75 years and
11:48again it is majority white and male,
11:51but it is important to note that even
11:54though women are only 15% of this cohort,
11:57this represents almost half a
12:00million women next.
12:01In terms of our sample size,
12:04as of May 16th at 5:41 PM,
12:07the cumulative number of coded
12:0819 cases within the VA with
12:11approximately 12,000 next.
12:14So how are you gonna test and adapt our
12:17system pipeline as a first step will
12:19be to restrict the Simpson dictionary
12:21so that the terms included are only
12:23those pertinent to COVID-19 next.
12:27The next step is to run this
12:30restricted symptom extractor
12:31pipeline on all of the notes and to
12:33have clinicians review to result.
12:357 conditions will review a
12:37random subset of 700 note.
12:39Conditions will first create guidelines
12:42for identifying positive and negative
12:44note based on their clinical knowledge
12:47and an initial review of 100 note.
12:49The guidelines will be revised.
12:51Intel Acampe of 0.85 for Inter
12:55rater reliability is achieved.
12:57Each condition will then review
12:59and evaluate a hundred-and-fifty
13:00notes out of the remaining 600
13:02nodes so that each node is reviewed
13:04by at least two clinicians.
13:06We will then compare reviewer assessments
13:08where the two reviewers disagree.
13:09The Pi will make the final decision next.
13:13The third step will be to compare
13:16the symptoms identified by the
13:18pipeline with those identified by
13:19the clinicians in these 700 notes,
13:21and we're targeting precision
13:24and recall at 0.8 next.
13:26If we do not achieve this goal,
13:28there are a number of approaches that we
13:30can use to improve pipeline performance.
13:32The first will be to augment the symptom
13:35terms identified by the dictionary.
13:36To do this,
13:37we will use topic modeling to identify
13:39relevant symptom terms in the note.
13:42Topic modeling is a machine learning
13:44techniques that can be applied to
13:46large corpora to discover themes,
13:48IE symptom topics that are
13:50semantically related.
13:51We can create Raina bidirectional
13:54encoder representations from
13:55Transformers or bird model on 10,000
13:58documents with keywords to boost the
14:00LP's ability to recognize synonyms
14:02related terms and misspelling.
14:04Finally,
14:04we can target the machine learning
14:07component of the pipeline and train
14:10and test support vector machine models
14:12with different configurations next.
14:15We're applying for funding for this
14:17project from the VA rapid response project.
14:20Calls were also submitting a proposal
14:22in response to why a sense called
14:25for intramural pilot gram next.
14:27Once we have adapted the pipeline
14:29to accurately identify COVID-19
14:31symptoms in VAEHR text notes,
14:33there are a number of projects that
14:36we are interested in pursuing next.
14:39The first project will focus on
14:41evaluating the risk of infection
14:43and death associated with SARS, Co.
14:45V2 and influenza in the six months
14:48following the index infection with COVID-19.
14:51So in 19 will be defined as a
14:53positive arc collected at least eight
14:55weeks after the index and affection
14:58and by the presence of symptoms.
15:00This project is led by Doctor Rupert,
15:03got an instruction Infectious Diseases
15:05at the West Haven BA and a yellow Haven.
15:08His mentors include doctors,
15:09Kathleen Aiken,
15:10Cynthia Branson name each up next.
15:14We're also interested in looking at
15:16symptoms versus symptom clusters,
15:18and their associations with Cobit
15:1919 testing and seropositivity.
15:21In particular,
15:22we are interested in exploring whether
15:24symptoms are symptom clusters differ by age,
15:26sex,
15:27race and be a region on the P
15:29on this project,
15:31and I'm working with doctors cut
15:34bacon brands and Justice next.
15:37Additional projects include
15:39Validating an approach to identifying
15:41COVID-19 infection in VA data for
15:43research in Qi purposes that include
15:45the combination of symptoms
15:47or symptom clusters,
15:48results of chest radiographs for CT scans,
15:51an arc testing were also interested
15:53in exploring whether or not we can
15:56use the adapted symptom extractor
15:57as the foundation for an EHR based
16:00bio surveillance system to identify
16:03the onset of new code.
16:0519 searches were interested in seeing
16:07whether or not this symptom extractor.
16:09Can be adapted to other electronic
16:11health records such as epics,
16:13into other electronic data
16:15sources such as Google.
16:17Finally, we're interested in
16:19looking at associations between
16:20symptoms and symptom clusters.
16:22With code 19 viral load next.
16:26All the work that I've described
16:28as the product of team science,
16:30members of the team are from Yale,
16:32the School of Nursing,
16:33and the school of Madison,
16:35George Washington University and OHSU next.
16:37Thank you much.
16:38Thank you very much for your time.
16:46Thank you very much.