## Preliminaries {-}
library(magrittr);
library(janitor); library(naniar)
library(rms)
library(mice)
# mice = multiple imputation through chained equations
library(broom); library(yardstick)
library(tableone)
library(pscl)
library(knitr)
library(tidyverse)
# library(cutpointr)
# library(OptimalCutpoints)
theme_set(theme_bw()) 
Prevalence of AIDS defining illness in HIV patients with Substance Abuse
- This retrospective cohort study using 2018 HCUP-NIS data (N=24,119) compared the burden of AIDS defining opportunistic infections (OIs) in HIV+ patients and length of associated hospital stay among those with substance use disorder (SUD) versus those without 
- Analyses were adjusted for demographics and hospital characteristics. Multiple imputation was used for missing covariate data. Burden of OIs was compared using logistic regression and length of stay was compared using Poisson regression models. 
- Contrary to the expectation that hospitalizations due to AIDS defining OIs would be higher in HIV+ patients with a SUD. Holding all other predictors constant, the model predicted that in PLWH, the odds of AIDS in is those with SUD is 0.726 (95% CI 0.668, 0.789) the odds of those without SUD (N=16,884) 
- Data cleaning and analyses were originally conducted using SAS 9.4 on a Linux/Unix based server. Analyses were later adjusted for covariates and R software was used 
- Presented at Case Western Reserve 
Background
It is estimated that almost have of people living with HIV in the US have a substance use disorder. This is a major public health concern because it has been found that people living with HIV and a comorbid substance use disorder (SUD) have lower retention to care and decreased medication adherence. Several previous studies have demonstrated that the incidence of opportunistic infections, a marker of disease progression, is higher in this population. There have been no recent studies that evaluate the relationship between a SUD and the presence of an opportunistic infection (i.e. AIDS defining illness) in PLWH. Furthermore, there has yet to be a study which evaluates this syndemic nationally.
Research Questions
- In people living with HIV (PLWH) in 2018, how does hospitalization due to an AIDS defining illnesses in those with a SUD compare to those without a SUD? 
- PLWH in 2018, how does the length of hospital stay for people with SUD compare to those without SUD? 
Data
Healthcare Cost and Utilization Project, Nationwide Inpatient Sample (HCUP-NIS) is the largest available all-payer inpatient healthcare administrative data set. It approximates a 20-percent stratified sample of all discharges from United States hospitals. It constitutes data from 48 states and 10,000 community hospitals, representing 95% of the United States population. Data from each record contains information regarding patient demographics, diagnoses, procedures, and other information associated with a hospital admission.
The data can be purchased by the public with at the following link: https://www.hcup-us.ahrq.gov/nisoverview.jsp
Strengths of how this data set relates to my research question:
- Nationally representative 
- Inpatient hospitalizations 
- Collects information on sociodemographic factors which I can adjust for 
- Collects information on up to 40 diagnoses, thus I can capture both the exposure and outcome 
- Collects information on length of stay 
Limitations of the data set:
- the data quality of secondary databases is not perfect as the diagnoses codes may not necessarily be accurate, granular, or complete - The HIV population tends to have many co-occurring conditions, and thus it is possible that not all SUD conditions were not recorded. Therefore, there may be some people in the unexposed group who should be in the exposed group
 
- The latest data available is from 2018. Drug therapy has dramatically changed with integrase inhibitors becoming first line and having drugs with longer half-lives and easier to adhere too. This is especially important for the SUD population who are less likely to be adherent. Thus, with with all of the advances in care, the 2018 analysis may not actually reflect 2021’s gaps in care. 
Preliminaries
Data Ingest
Below I am ingesting my data hiv_raw.
hiv_raw <- read.csv("hiv_oi.csv") %>% 
  clean_names()
hiv_raw <- hiv_raw %>% 
  haven::zap_label()  %>% 
  mutate(key_nis = as.character(key_nis))
dim(hiv_raw)[1] 24203    20Tidying, Data Cleaning and Data Management
Below I am cleaning the data according to the HCUP_NIS code book found at https://www.hcup-us.ahrq.gov/db/nation/nis/nisdde.jsp
In summary I have:
1.) converted all variables to factors except for age and key_nis
2.) coded the variable levels for factors with more descriptive names rather than numbers
3.) reordered according to frequency with fct_infreq
4.) selected only the variables that I will use
5.) converted los to a number
hiv <- hiv_raw %>% 
mutate(female=as.numeric(female)) %>% 
  mutate(sex = fct_recode(factor(female),
"male" = "0", "female" = "1"),
        race = fct_recode(factor(race),
                "White" = "1", 
                "Black" = "2", 
                "Hispanic"= "3",
                "Asian" = "4",
                "NativeA"= "5",
                "Other" = "6"),
    race = fct_infreq(race), 
    zipinc_qrtl = fct_recode(factor(zipinc_qrtl),
                            "<48K"= "1",
                            "48-61K" = "2",
                            "61-82K"= "3",
                            "82K+" = "4")) %>% 
mutate(pay1=as.numeric(pay1)) %>% 
  mutate(insurance  = fct_recode(factor(pay1),
                  "Medicare" = "1",
                  "Medicaid" = "2",
                  "Private" = "3",
                  "Self_pay" = "4",
                  "Other" = "5",
                  "Other" = "6"),
insurance = fct_infreq(insurance), 
    patient_loc =  fct_recode(factor(pl_nchs),
                  "Central" = "1",
                  "Fringe" = "2",
                  "metro>250K" = "3",
                  "metro>50K" = "4",
                  "micro" = "5",
                  "Other" = "6" ),
patient_loc = fct_infreq(patient_loc)) %>% 
  mutate(region = fct_recode(factor(hosp_division),
                  "Northeast" = "1",
                  "Northeast" = "2",
                  "Midwest" = "3",
                  "Midwest" = "4",
                  "South_Atlantic" = "5",
                  "South" = "6",
                  "South" = "7",
                  "West" = "8",
                  "West" = "9"),
        region = fct_infreq(region)) %>% 
  mutate(ED_record = fct_recode(factor(hcup_ed),
          "no" = "0",
          "yes" = "1", "yes" = "2", "yes" ="3", "yes"="4")) %>% 
  mutate(subst_abuse=fct_recode(factor(sa),
         "yes" = "1",
         "no"= "0"),
         subst_abuse = fct_relevel(subst_abuse, "no")) %>% 
  mutate(los=as.numeric(los)) %>% 
  
 mutate(AIDS_f = fct_recode(factor(oi),
                           "yes"= "1",
                           "no" ="0")) %>% 
  mutate(AIDS_f=fct_relevel(AIDS_f, "no")) %>% 
  rename(AIDS= oi) %>% 
  select(key_nis, subst_abuse, AIDS, AIDS_f, los, age, sex, race, region, zipinc_qrtl, insurance, patient_loc, ED_record) Below we can see that we have all of the variables in the form that they should be in:
1.) character: key_nis
2.) factor: subst_abuse, AIDS, sex, race, region, zipinc_qrtl, insurance, patient_loc, ED_record
3.) numeric: oi (will be an indicator in logistic regression), los,
head(hiv)   key_nis subst_abuse AIDS AIDS_f los age    sex     race    region
1 10317955         yes    0     no   2  68   male    White Northeast
2 10160121         yes    0     no   8  31 female    White Northeast
3 10137460          no    0     no   3  50   male    White Northeast
4 10324914          no    0     no   5  50   male    White Northeast
5 10029616         yes    0     no   2  41 female Hispanic Northeast
6 10204914          no    0     no   1  46 female    White Northeast
  zipinc_qrtl insurance patient_loc ED_record
1        <48K  Medicare       Other        no
2      61-82K  Medicaid  metro>250K       yes
3        <48K  Medicare       Other        no
4        <48K  Medicare       Other        no
5        <48K  Medicaid  metro>250K       yes
6        <48K  Medicare  metro>250K       yesEligibiity
I am only going to include adults in this study (age>19)
hiv <- hiv %>% 
  filter(age>19)Checking the variables
Categorical Variables
Here I am making sure that all of the factor levels work out (subst_abuse, AIDS, sex, race, region, zipinc_qrtl, insurance, patient_loc,ED_record)
hiv %>% count(subst_abuse)  subst_abuse     n
1          no 12329
2         yes 11791The two categories of subst_abuse look good.
hiv %>% count(AIDS_f)  AIDS_f     n
1     no 19720
2    yes  4400The two categories of AIDS_f look good
hiv %>% count(sex)     sex     n
1   male 16847
2 female  7266
3   <NA>     7The two categories of sex look good
hiv %>% count(race)      race     n
1    Black 12202
2    White  6812
3 Hispanic  3633
4    Other   848
5    Asian   224
6  NativeA   117
7     <NA>   284The 6 categories of race look good
hiv %>% count(region)          region    n
1 South_Atlantic 7337
2      Northeast 5641
3          South 4282
4           West 4007
5        Midwest 2853The 5 categories of region look good
The 4 categories of zipinc_qrtl look good
hiv %>% count(insurance)  insurance    n
1  Medicaid 9511
2  Medicare 8490
3   Private 3654
4  Self_pay 1753
5     Other  687
6      <NA>   25The 5 categories of insurance look good
hiv %>% count(ED_record)  ED_record     n
1        no  4791
2       yes 19329The 2 categories of ED_record look good
hiv %>% count(patient_loc)  patient_loc     n
1     Central 12920
2      Fringe  4043
3  metro>250K  3914
4   metro>50K  1276
5       micro   912
6       Other   535
7        <NA>   5206 categories for patient_loc look good
Quantitative variables
I have two quantitative variables: length of stay (los) and age
age
Below are numeric and plotted summaries of the age distribution.
mosaic::favstats(~age, data=hiv)Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2 min Q1 median Q3 max     mean       sd     n missing
  20 40     51 58  90 49.48238 12.72655 24120       0ggstatsplot::gghistostats(
  data = hiv,
  x = age,
  type = "np",
  xlab = "age",
  bar.fill = "lightblue")
It looks like the age range is fine.
length of stay
Our length of stays range from 0 to 294. Those all look plausible.
mosaic::favstats(~los, data=hiv) min Q1 median Q3 max     mean       sd     n missing
   0  2      4  8 294 7.027739 9.720441 24118       2ggstatsplot::gghistostats(
  data = hiv,
  x = los,
  type = "np",
  xlab = "length of stay",
  bar.fill = "lightblue")
I just want to say how awesome it is that we are seeing PLWH living into their 90’s. Gives me the chills. This is a miracle that really shows how far we have come in treating this disease (unimaginable 30 years ago)
Missingness
I have 1685 missing observations in the hiv data set.
Below we can see that we have the most missingness for zipinc_qrtl (3.5%), patient_loc (2.2%), and race (1.2%)
gg_miss_var(hiv) 
This means that if I do multiple imputation, I should do at least 4 iterations.
miss_var_summary(hiv) # A tibble: 13 × 3
   variable    n_miss pct_miss
   <chr>        <int>    <dbl>
 1 zipinc_qrtl    847  3.51   
 2 patient_loc    520  2.16   
 3 race           284  1.18   
 4 insurance       25  0.104  
 5 sex              7  0.0290 
 6 los              2  0.00829
 7 key_nis          0  0      
 8 subst_abuse      0  0      
 9 AIDS             0  0      
10 AIDS_f           0  0      
11 age              0  0      
12 region           0  0      
13 ED_record        0  0      About 95% of the cases aren’t missing any data.
miss_case_table(hiv)# A tibble: 4 × 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0   22971   95.2   
2              1     619    2.57  
3              2     524    2.17  
4              3       6    0.0249Removing observations with missing outcome
There were 2 people missing data on the outcome, los. I will remove them.
hiv <- hiv %>% filter(complete.cases(los))Missingness mechanism
I am assuming that the data are missing at random. I had originally thought it was MNAR, but the issue here is why the data are missing. Although my prediction will not be perfect, the covariates I will use to predict the missing values may do a reasonable job.
Tidied Tibble
Our tibble hiv contains 24118 rows (patients) and 13 columns (variables). Each variable is contained in a column, and each row represents a single key_nis. All variables now have appropriate types.
head(hiv) %>% kable()| key_nis | subst_abuse | AIDS | AIDS_f | los | age | sex | race | region | zipinc_qrtl | insurance | patient_loc | ED_record | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10317955 | yes | 0 | no | 2 | 68 | male | White | Northeast | <48K | Medicare | Other | no | 
| 10160121 | yes | 0 | no | 8 | 31 | female | White | Northeast | 61-82K | Medicaid | metro>250K | yes | 
| 10137460 | no | 0 | no | 3 | 50 | male | White | Northeast | <48K | Medicare | Other | no | 
| 10324914 | no | 0 | no | 5 | 50 | male | White | Northeast | <48K | Medicare | Other | no | 
| 10029616 | yes | 0 | no | 2 | 41 | female | Hispanic | Northeast | <48K | Medicaid | metro>250K | yes | 
| 10204914 | no | 0 | no | 1 | 46 | female | White | Northeast | <48K | Medicare | metro>250K | yes | 
Code Book and Clean Data Summary
1. Sample Size The data in our complete hiv sample consist of 24118 subjects from HCUP-NIS with a diagnosis of HIV and between the ages of 20 and 90 in whom our outcome variable was measured.
2. Missingness Of the 24118 subjects, 22971 have complete data on all variables listed below.
3. Our outcome variables are los and AIDS.
a. los is the number of days that the patient was hospitalized for. HCUP-NIS calculated it by subtracting the admission date from the discharge date
b. AIDS is if the person had a diagnosis for an opportunistic infection, which were AIDS defining.
NOTE the definition of AIDS is either (1) CD4 <200 OR presence of an AIDS defining opportunistic infection. According to the CDC, AIDS defining illnesses include candidiasis of the esophagus (B37.81), bronchi , trachea, or lungs (B371); invasive cervical cancer (C53); coccidiomycosis (B38); cryptococcosis (B45); cryptosporidiosis(A07.2); cytomegalovirus disease or CMV(B25); histoplasmosis (B39); isosporiasis (A07.3); Kaposi sarcoma (C46); Burkitt’s, immunoblastic, Hodgkin’s, and Non- Hodgkin’s lymphoma (Burkitt’s, immunoblastic); mycobacterium avium complex (A31.2,A31.8); mycobacterium tuberculosis (A15); pneumocystis pneumonia (B59); recurrent pneumonia (Z87.01); progressive multifocal leukoencephalopathy (A81.2), salmonella septicemia (A02.1) and toxoplasmosis of brain (B58.2)
4. All other variables listed below will serve as candidate predictors for our models.
5. The other variable contained in my tidy tibble is key_nis which is the key_nis identifying code.
| Variable | Type | Description | 
|---|---|---|
| key_nis | character | key_nis identifier | 
| subst_abuse | binary | main predictor:whether or not somebody has of substance use disorder. Patients were classified as having a history of SUD if they had an ICD-10 code for abuse of alcohol (F10), opioids, sedatives, hypnotics, anxiolytics (F13), cocaine (F14), other stimulants (F15), hallucinogens (F16), inhalants (F18), or other psychoactive substances/multiple drug use (F19) (yes/no) | 
| age | quant | age in years | 
| sex | binary | male, female | 
| race | 5-cat | Black, White, Hispanic, Other, Asian, Native American | 
| region | 5-cat | Northeast, Midwest, South, South_Atlantic, West | 
| zipinc_qrtl | 4-cat | Median household income for patient’s ZIP Code (based on current year). Values include <48K, 48-61K, 61-82K, 82K+. | 
| insurance | 5-cat | expected primary payer (Medicare, Medicaid, private insurance, self pay, other) | 
| patient_loc | 6-cat | Patient location (“Central” counties of metro areas of >=1 million population, “Fringe” counties of metro areas of >=1 million population, Counties in metro areas of 250,000-999,999 population, Counties in metro areas of 50,000-249,999 population, Micropolitan counties, Not metropolitan or micropolitan counties) | 
| ED_record | binary | records that have evidence of emergency department (ED) services reported on the HCUP record (yes/no) | 
age | quant | age in years.
sex | binary | male, female.
race | 5-cat | Black, White, Hispanic, Other, Asian, Native American
Data Distribution Description
I have used the html function applied to an Hmisc::describe() to provide a clean data summary
hiv %>% 
    select(-key_nis) %>%
    Hmisc::describe() %>% Hmisc::html()12 Variables 24118 Observations
subst_abuse
| n | missing | distinct | 
|---|---|---|
| 24118 | 0 | 2 | 
Value no yes Frequency 12327 11791 Proportion 0.511 0.489
AIDS
| n | missing | distinct | Info | Sum | Mean | Gmd | 
|---|---|---|---|---|---|---|
| 24118 | 0 | 2 | 0.447 | 4399 | 0.1824 | 0.2983 | 
AIDS_f
| n | missing | distinct | 
|---|---|---|
| 24118 | 0 | 2 | 
Value no yes Frequency 19719 4399 Proportion 0.818 0.182
los
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24118 | 0 | 122 | 0.99 | 7.028 | 7.125 | 1 | 1 | 2 | 4 | 8 | 15 | 21 | 
age
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24118 | 0 | 71 | 0.999 | 49.48 | 14.45 | 27 | 31 | 40 | 51 | 58 | 65 | 69 | 
sex
| n | missing | distinct | 
|---|---|---|
| 24111 | 7 | 2 | 
Value male female Frequency 16846 7265 Proportion 0.699 0.301
race
| n | missing | distinct | 
|---|---|---|
| 23834 | 284 | 6 | 
Value Black White Hispanic Other Asian NativeA Frequency 12202 6812 3632 847 224 117 Proportion 0.512 0.286 0.152 0.036 0.009 0.005
region
| n | missing | distinct | 
|---|---|---|
| 24118 | 0 | 5 | 
| lowest : | South_Atlantic | Northeast | South | West | Midwest | 
| highest: | South_Atlantic | Northeast | South | West | Midwest | 
 Value      South_Atlantic      Northeast          South           West
 Frequency            7337           5640           4282           4006
 Proportion          0.304          0.234          0.178          0.166
                          
 Value             Midwest
 Frequency            2853
 Proportion          0.118
 
 zipinc_qrtl
| n | missing | distinct | 
|---|---|---|
| 23272 | 846 | 4 | 
Value <48K 48-61K 61-82K 82K+ Frequency 11397 5408 3840 2627 Proportion 0.490 0.232 0.165 0.113
insurance
| n | missing | distinct | 
|---|---|---|
| 24093 | 25 | 5 | 
Value Medicaid Medicare Private Self_pay Other Frequency 9509 8490 3654 1753 687 Proportion 0.395 0.352 0.152 0.073 0.029
patient_loc
| n | missing | distinct | 
|---|---|---|
| 23599 | 519 | 6 | 
| lowest : | Central | Fringe | metro>250K | metro>50K | micro | 
| highest: | Fringe | metro>250K | metro>50K | micro | Other | 
Value Central Fringe metro>250K metro>50K micro Other Frequency 12919 4043 3914 1276 912 535 Proportion 0.547 0.171 0.166 0.054 0.039 0.023
ED_record
| n | missing | distinct | 
|---|---|---|
| 24118 | 0 | 2 | 
Value no yes Frequency 4790 19328 Proportion 0.199 0.801
Analysis 1 AIDS defining Illness (Logistic Regression)
Plans
Binary Outcome
- My binary outcome is AIDS
- There are no missing cases on this outcome
hiv %>% select(AIDS) %>% miss_case_table() %>% kable()| n_miss_in_case | n_cases | pct_cases | 
|---|---|---|
| 0 | 24118 | 100 | 
Planned Predictors
1. subst_abuse
- main predictor 
- binary 
- 0% missing 
2. zipinc_qrtl
- 4 category 
- 3.5% missing 
3. patient_loc
- 6 categories 
- 2.15% missing 
4. race
- 5 categories 
- 1.2% missing 
5. insurance
- 5 categories 
- 0.10% missing 
6. sex
- binary 
- 0.03% missing 
7. age
- continuous 
- 0% missing 
8. region
- 5 categories 
- 0% missing 
Describing the data (visualization)
Two by two table
I have made a two by two table
- The exposure is subst_abuse, which is represented in the rows of my table below. It is a factor that has two levels: yes and no, which represents patients who have a substance use disorder (abuse of alcohol, opioids, sedatives, hypnotics, anxiolytics, cocaine, other stimulants, hallucinogens, inhalants, or other psychoactive substances/multiple drug use).
- The outcome is AIDS_f, which is represented in the columns of my table below. It is a factor that has two levels: yes and no, which represents patients who have an opportunistic infection (AIDS defining illness)
- according to the CDC is an AIDS defining illness (candidiasis of the esophagus (B37.81), bronchi , trachea, or lungs (B371); invasive cervical cancer (C53); coccidiomycosis (B38); cryptococcosis (B45); cryptosporidiosis(A07.2); cytomegalovirus disease or CMV(B25); histoplasmosis (B39); isosporiasis (A07.3); Kaposi sarcoma (C46); Burkitt’s, immunoblastic, Hodgkin’s, and Non- Hodgkin’s lymphoma (Burkitt’s, immunoblastic); mycobacterium avium complex (A31.2, A31.8); mycobacterium tuberculosis (A15); pneumocystis pneumonia (B59); recurrent pneumonia (Z87.01); progressive multifocal leukoencephalopathy (A81.2), salmonella septicemia (A02.1) and toxoplasmosis of brain (B58.2))
hiv_reordered <- hiv %>% 
mutate(subst_abuse=fct_relevel(subst_abuse, "yes" )) %>% 
  mutate(AIDS_f=fct_relevel(AIDS_f, "yes" ))
tableE <- hiv_reordered %>%  
 tabyl (subst_abuse, AIDS_f) 
tableE %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages(denom = "row") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns(position = "front") %>% knitr::kable()| subst_abuse | yes | no | Total | 
|---|---|---|---|
| yes | 1929 (16.4%) | 9862 (83.6%) | 11791 (100.0%) | 
| no | 2470 (20.0%) | 9857 (80.0%) | 12327 (100.0%) | 
| Total | 4399 (18.2%) | 19719 (81.8%) | 24118 (100.0%) | 
Among those who had a substance abuse disorder, 16.4%% (1929 /11791) had an AIDS defining illness. In those who did not have a substance use disorder, 20.0%% (2471/12329) had an and AIDS defining illness
This lower prevalence of AIDS defining illness in people with SUD is surprising given previous literature. After adjustment, we will see if this relationship still holds or if it is meaningfully lower. I’m excited to see what happens!
Table one
This table has the covariates that I will be adjusting for as I explore the relationship between subst_abuse and AIDS.
vars <- c("region", "age", "sex", "insurance", "race", "patient_loc", "zipinc_qrtl")
factorvars <- c("region", "sex", "insurance", "race", "patient_loc", "zipinc_qrtl")
trt <- c("subst_abuse")
table01 <- CreateTableOne(data = hiv_reordered,  
                       vars = vars, 
                       factorVars= factorvars,
                       strata = trt)
print(table01, verbose=TRUE)                    Stratified by subst_abuse
                    yes           no            p      test
  n                 11791         12327                    
  region (%)                                    <0.001     
     South_Atlantic  3392 (28.8)   3945 (32.0)             
     Northeast       2944 (25.0)   2696 (21.9)             
     South           1956 (16.6)   2326 (18.9)             
     West            2001 (17.0)   2005 (16.3)             
     Midwest         1498 (12.7)   1355 (11.0)             
  age (mean (SD))   48.22 (11.58) 50.69 (13.62) <0.001     
  sex = female (%)   3413 (29.0)   3852 (31.3)  <0.001     
  insurance (%)                                 <0.001     
     Medicaid        5773 (49.0)   3736 (30.3)             
     Medicare        3595 (30.5)   4895 (39.7)             
     Private         1185 (10.1)   2469 (20.0)             
     Self_pay         892 ( 7.6)    861 ( 7.0)             
     Other            330 ( 2.8)    357 ( 2.9)             
  race (%)                                      <0.001     
     Black           5990 (51.4)   6212 (51.0)             
     White           3534 (30.3)   3278 (26.9)             
     Hispanic        1590 (13.6)   2042 (16.8)             
     Other            402 ( 3.4)    445 ( 3.7)             
     Asian             65 ( 0.6)    159 ( 1.3)             
     NativeA           73 ( 0.6)     44 ( 0.4)             
  patient_loc (%)                               <0.001     
     Central         6453 (56.7)   6466 (52.9)             
     Fringe          1642 (14.4)   2401 (19.7)             
     metro>250K      1990 (17.5)   1924 (15.8)             
     metro>50K        604 ( 5.3)    672 ( 5.5)             
     micro            440 ( 3.9)    472 ( 3.9)             
     Other            258 ( 2.3)    277 ( 2.3)             
  zipinc_qrtl (%)                               <0.001     
     <48K            6028 (53.7)   5369 (44.6)             
     48-61K          2499 (22.3)   2909 (24.2)             
     61-82K          1669 (14.9)   2171 (18.0)             
     82K+            1032 ( 9.2)   1595 (13.2)             Caption: Baseline characteristics for 24,120 hospitalized HIV patients in 2018 (HCUP-NIS data) in those with substance use disorder versus those without. NOTE: percent of missing values: zipinc_qrtl (3.5%), patient_loc (2.15%), race (1.2%), insurance (0.10%), sex (0.03%), age (0%), region(0%)
We can see that there is definitely an imbalance between some of these baseline characteristics. The most meaningful differences appear in:
- median income based on zip for the <48K group
- more medicaid insurance in the SUD group
- the people in the SUD group are slightly younger
These demographic factors have been adjusted for in other papers which have shown AIDS defining illness to be higher in those with SUD. So we will see what happens!
Splitting the data
- I will start by splitting the data into separate training and testing samples
- I have used the option strata to ensure that the same percentage of people in both samples have the outcome of an AIDS defining illness
hiv1 <- hiv %>% select(subst_abuse, AIDS, region, age, sex, insurance, race, patient_loc, zipinc_qrtl )
set.seed(30)
hiv_split1 <- rsample::initial_split(hiv1, prop = 0.7,
strata = AIDS)
hiv_train1 <- rsample::training(hiv_split1) 
hiv_test1 <- rsample::testing(hiv_split1)
dim(hiv_test1)[1] 7236    9Below we can see that 18.2% of people in each group had the outcome of an AIDS defining illness
hiv_train1 %>% tabyl(AIDS) %>% kable(dig=3)| AIDS | n | percent | 
|---|---|---|
| 0 | 13803 | 0.818 | 
| 1 | 3079 | 0.182 | 
hiv_test1 %>% tabyl(AIDS) %>% kable(dig=3)| AIDS | n | percent | 
|---|---|---|
| 0 | 5916 | 0.818 | 
| 1 | 1320 | 0.182 | 
Multiple Imputation
training sample multiple imputation
I will impute values for the predictors zipinc_qrtl, patient_loc, race, and insurance
Since the highest percent of missingness I have is 3.5% (zipinc_qrtl), I will do 4 imputed data sets (m=4)
hiv1_mice<- mice(hiv_train1, m = 4, printFlag = F)Below is a summary of the multiple imputation process
summary(hiv1_mice)Class: mids
Number of multiple imputations:  4 
Imputation methods:
subst_abuse        AIDS      region         age         sex   insurance 
         ""          ""          ""          ""    "logreg"   "polyreg" 
       race patient_loc zipinc_qrtl 
  "polyreg"   "polyreg"   "polyreg" 
PredictorMatrix:
            subst_abuse AIDS region age sex insurance race patient_loc
subst_abuse           0    1      1   1   1         1    1           1
AIDS                  1    0      1   1   1         1    1           1
region                1    1      0   1   1         1    1           1
age                   1    1      1   0   1         1    1           1
sex                   1    1      1   1   0         1    1           1
insurance             1    1      1   1   1         0    1           1
            zipinc_qrtl
subst_abuse           1
AIDS                  1
region                1
age                   1
sex                   1
insurance             1we can see that I did 4 imputations, which variables had missing, and how those variables were imputed.
testing sample single imputatoin
I am using mice, but just pulling out one imputed data set which I will call imp_test. I will use this when I am validating.
hivtest_mice<- mice(hiv_test1, m = 1, printFlag = F)
imp_test <- complete(hivtest_mice, 1) %>% tibble() 
dim(imp_test)[1] 7236    9Below I am just checking to make sure that I have no more missing
n_miss(imp_test)[1] 0no more missing!
Model 1
Fitting Model 1
- mod1 predicts the log odds of AIDS using the predictors subst_abuse, region, age, sex, insurance, race, patient_loc, zipinc_qrtl
- I chose these predictors based on previous literature and reasoning
glm model with multiple imputation
First I am running mod1 on each of the 4 imputed data sets
m1_mods <- with(hiv1_mice, 
                glm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl,
                    family = binomial))lrm & glm with single imputation
Because lrm does not work with mice, I will build an lrm model from one of the 4 imputation sets
(lrm requires areg_impute)
The code below stores the 4th imputed data set in imp_4
imp_4 <- complete(hiv1_mice, 4) %>% tibble() 
dim(imp_4)[1] 16882     9zz <- datadist(imp_4) 
options(datadist = "zz")
mod1_lrm <- lrm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)Also, to use tools like augment for the confusion matrix, I will also need to build a glm model with a single imputed data set.
mod1_glm <- glm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4)tidied table of regression coefficients
Below is a table of the exponentiated coefficients so that they can be interpreted as odds ratios.
NOTE: I have first pooled the results from the 4 analyses with the 4 imputed data sets (`m1_pool`)
m1_pool <- pool(m1_mods)
sum1 <- summary(m1_pool, exponentiate = TRUE, 
        conf.int = TRUE, conf.level = 0.95) %>% 
    select(-df) 
sum1 %>% kable(digits = c(3, 3, 2, 2, 3, 3))| term | estimate | std.error | statistic | p.value | 2.5 % | 97.5 % | 
|---|---|---|---|---|---|---|
| (Intercept) | 1.136 | 0.10 | 1.28 | 0.201 | 0.934 | 1.381 | 
| subst_abuseyes | 0.718 | 0.04 | -7.77 | 0.000 | 0.661 | 0.781 | 
| regionNortheast | 0.758 | 0.06 | -4.35 | 0.000 | 0.669 | 0.859 | 
| regionSouth | 1.218 | 0.06 | 3.27 | 0.001 | 1.082 | 1.371 | 
| regionWest | 1.300 | 0.07 | 3.93 | 0.000 | 1.141 | 1.481 | 
| regionMidwest | 0.947 | 0.07 | -0.76 | 0.447 | 0.822 | 1.090 | 
| age | 0.971 | 0.00 | -16.50 | 0.000 | 0.968 | 0.975 | 
| sexfemale | 0.870 | 0.05 | -2.99 | 0.003 | 0.794 | 0.953 | 
| insuranceMedicare | 0.670 | 0.05 | -7.46 | 0.000 | 0.603 | 0.745 | 
| insurancePrivate | 0.927 | 0.06 | -1.24 | 0.215 | 0.822 | 1.045 | 
| insuranceSelf_pay | 1.068 | 0.08 | 0.86 | 0.387 | 0.920 | 1.241 | 
| insuranceOther | 0.999 | 0.12 | -0.01 | 0.991 | 0.794 | 1.256 | 
| raceWhite | 0.955 | 0.05 | -0.87 | 0.386 | 0.861 | 1.060 | 
| raceHispanic | 1.180 | 0.06 | 2.76 | 0.006 | 1.049 | 1.328 | 
| raceOther | 1.195 | 0.11 | 1.62 | 0.105 | 0.964 | 1.483 | 
| raceAsian | 1.508 | 0.19 | 2.13 | 0.033 | 1.034 | 2.199 | 
| raceNativeA | 1.149 | 0.29 | 0.48 | 0.635 | 0.648 | 2.038 | 
| patient_locFringe | 1.090 | 0.06 | 1.40 | 0.160 | 0.966 | 1.229 | 
| patient_locmetro>250K | 0.995 | 0.06 | -0.08 | 0.932 | 0.882 | 1.122 | 
| patient_locmetro>50K | 0.968 | 0.10 | -0.34 | 0.735 | 0.802 | 1.168 | 
| patient_locmicro | 0.890 | 0.12 | -1.01 | 0.312 | 0.710 | 1.115 | 
| patient_locOther | 1.035 | 0.14 | 0.24 | 0.813 | 0.781 | 1.371 | 
| zipinc_qrtl48-61K | 1.046 | 0.05 | 0.84 | 0.400 | 0.941 | 1.163 | 
| zipinc_qrtl61-82K | 1.053 | 0.06 | 0.85 | 0.398 | 0.934 | 1.188 | 
| zipinc_qrtl82K+ | 1.028 | 0.07 | 0.37 | 0.708 | 0.888 | 1.192 | 
Among hospitalized PLWH (N=16884), after adjusting for region, age,sex, insurance, race, patient_loc, zipinc_qrtl,mod1 predicts that the odds of having and AIDS defining illness in those with SUD is 0.726 (95% CI 0.668, 0.789) times those without SUD
- given that the 95% CI is entirely below 1, the model suggests that having a SUD is associated with a lower odds of an AIDS defining illness
key fit summary statistics
Below are the key fit summary statistics like the Nagelkerke R-square and the area under the ROC curve as they are presented in the `lrm` output
The r2 is very low (0.066) as well as the C statistic (0.650)
mod1_lrmLogistic Regression Model
 
 lrm(formula = AIDS ~ subst_abuse + region + age + sex + insurance + 
     race + patient_loc + zipinc_qrtl, data = imp_4, x = TRUE, 
     y = TRUE)
 
                        Model Likelihood        Discrimination    Rank Discrim.    
                              Ratio Test               Indexes          Indexes    
 Obs         16882    LR chi2     712.92        R2       0.067    C       0.650    
  0          13803    d.f.            24     R2(24,16882)0.040    Dxy     0.300    
  1           3079    Pr(> chi2) <0.0001    R2(24,7552.3)0.087    gamma   0.300    
 max |deriv| 6e-10                              Brier    0.143    tau-a   0.089    
 
                        Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept               0.1255 0.0995   1.26 0.2069  
 subst_abuse=yes        -0.3308 0.0426  -7.76 <0.0001 
 region=Northeast       -0.2742 0.0637  -4.30 <0.0001 
 region=South            0.2001 0.0603   3.32 0.0009  
 region=West             0.2679 0.0664   4.04 <0.0001 
 region=Midwest         -0.0524 0.0720  -0.73 0.4672  
 age                    -0.0289 0.0018 -16.48 <0.0001 
 sex=female             -0.1400 0.0466  -3.01 0.0027  
 insurance=Medicare     -0.4020 0.0536  -7.49 <0.0001 
 insurance=Private      -0.0706 0.0611  -1.16 0.2476  
 insurance=Self_pay      0.0663 0.0763   0.87 0.3848  
 insurance=Other         0.0019 0.1169   0.02 0.9870  
 race=White             -0.0486 0.0528  -0.92 0.3574  
 race=Hispanic           0.1631 0.0596   2.74 0.0062  
 race=Other              0.1671 0.1092   1.53 0.1259  
 race=Asian              0.3896 0.1923   2.03 0.0428  
 race=NativeA            0.0863 0.2867   0.30 0.7635  
 patient_loc=Fringe      0.0994 0.0595   1.67 0.0945  
 patient_loc=metro>250K -0.0055 0.0589  -0.09 0.9259  
 patient_loc=metro>50K  -0.0383 0.0959  -0.40 0.6897  
 patient_loc=micro      -0.1119 0.1148  -0.98 0.3295  
 patient_loc=Other       0.0249 0.1412   0.18 0.8601  
 zipinc_qrtl=48-61K      0.0400 0.0524   0.76 0.4455  
 zipinc_qrtl=61-82K      0.0531 0.0611   0.87 0.3849  
 zipinc_qrtl=82K+        0.0125 0.0722   0.17 0.8627  
 Confusion Matrix
Below is the code to augment `mod1_glm` in order to get the predicted values (still within the training sample)
hiv1_aug <- augment(mod1_glm, imp_4, type.predict = "response")I have plotted `mod1_glm` fits by observed `AIDS` status.
ggplot(hiv1_aug, aes(x = factor(AIDS), y = .fitted, col = factor(AIDS))) + geom_boxplot() +
geom_jitter(width = 0.1) + guides(col = FALSE) +
  labs(title = "mod1 fits by observed AIDS status (n=16,884)", footnote= "Highest predicted value <0.5", x= "mod1 fitted probabilities")Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
Overall, the predicted probabilities is higher for those who actually had an AIDS defining illness. It is important to note that our highest predicted probability does not reach 0.5, so we cannot use that as our cutoff for the confusion matrix. I must make the cutoff something lower.
Below is the confusion matrix (`caret `package). Rather than setting the cutoff at 0.5, I set it at 0.27 after evaluating the plot above.
cmatrix <- hiv1_aug %$%
  caret::confusionMatrix(
    data = factor(.fitted >= 0.27), 
    reference = factor(AIDS == 1), 
    positive = "TRUE"
  ) 
cmatrix Confusion Matrix and Statistics
          Reference
Prediction FALSE  TRUE
     FALSE 12159  2270
     TRUE   1644   809
                                          
               Accuracy : 0.7682          
                 95% CI : (0.7617, 0.7745)
    No Information Rate : 0.8176          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.156           
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.26275         
            Specificity : 0.88090         
         Pos Pred Value : 0.32980         
         Neg Pred Value : 0.84268         
             Prevalence : 0.18238         
         Detection Rate : 0.04792         
   Detection Prevalence : 0.14530         
      Balanced Accuracy : 0.57182         
                                          
       'Positive' Class : TRUE            
                                          Key results of the confusion matrix include:
- sensitivity: 0.25
- specificity: 0.88
- positive predictive value: 0.32
Nonlinearity
Below is the Spearman rho squared plot to evaluate the predictive punch of each of my variables in `mod1`
spear_mod1 <- spearman2(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4)
plot(spear_mod1)
The Spearman rho-squared plot suggests that I have the most predictive punch with age and insurance. I will:
(1) create an interaction between age and insurance
(2) add a restricted cubic spline for age (4 knots)
mod1 comparison to nonlinear models
**modifications to mod1**
- mod1b:
- restricted cubic spline of 4 knots with - age
- interaction between - ageand- insurance
- mod1c
- restricted cubic spline of 4 knots with age
- `mod1d `
- interaction between ageandinsurance
I am fitting these models with both glm and lrm as seen below:
mod1b_lrm <- lrm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)
mod1c_lrm <- lrm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)
mod1d_lrm <- lrm(AIDS ~ subst_abuse + region + age + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)mod1b_glm <- glm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4)
mod1c_glm <- glm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance  + sex + race + patient_loc + zipinc_qrtl, data = imp_4)
mod1d_glm <- glm(AIDS ~ subst_abuse + region + age + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4)Summary Statistics for in sample fit
The AIC & BIC are the best for mod1c
bind_rows(glance(mod1_glm), glance(mod1b_glm), glance(mod1c_glm), glance(mod1d_glm)) %>% 
  mutate(model = c("1", "1b", "1c", "1d")) %>%
select(model, nobs, deviance, df.residual, AIC, BIC) %>%
  kable(digits = 0)| model | nobs | deviance | df.residual | AIC | BIC | 
|---|---|---|---|---|---|
| 1 | 16882 | 2412 | 16857 | 15115 | 15316 | 
| 1b | 16882 | 2410 | 16851 | 15109 | 15357 | 
| 1c | 16882 | 2410 | 16855 | 15104 | 15321 | 
| 1d | 16882 | 2412 | 16853 | 15118 | 15350 | 
Comparison of Models With ANOVA
Since `mod1` is a subset of `mod1b`, `mod1c`, and `mod1d`, I can compare these models with ANOVA tests.
mod1 vs mod1b
anova(mod1_glm, mod1b_glm)Analysis of Deviance Table
Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + rcs(age, 4) + insurance + age %ia% 
    insurance + sex + race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16857     2412.4            
2     16851     2409.9  6   2.5186The addition of a restricted cubic spline with 4 knots for age and an interaction between age and insurance reduces the lack of fit by 4.1426 points, at a cost of 6 degrees of freedom. Thus, the fuller model may not be an improvement.
mod1 vs mod1c
anova(mod1_glm, mod1c_glm)Analysis of Deviance Table
Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + rcs(age, 4) + insurance + sex + 
    race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16857     2412.4            
2     16855     2410.3  2   2.1222The addition of a restricted cubic spline with 4 knots for age and reduced the lack of fit by 3.7793 points, at a cost of 2 degrees of freedom. Thus, this nonlinear model may not be an improvement.
mod1 vs mod1d
anova(mod1_glm, mod1d_glm)Analysis of Deviance Table
Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + age + insurance + age %ia% insurance + 
    sex + race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16857     2412.4            
2     16853     2411.8  4  0.66891The addition of an interaction between `age` and insurance reduced the lack of fit by 0.85066 points, at a cost of 2 degrees of freedom. Thus, the linear model may not be an improvement.
Comparing Validated Nagelkerke R-square and C statistic
I used the validate command on each of the models, which provided me with the following results:
- R2: higher better 
- brier score: *lower* = better (calibration) 
- C statistic: higher=better 
Note: the validated results (table 2) holds more weight when choosing models because it predicts how the model would perform out of sample.
Table 1: Index Fit Statistics Comparing `mod1` to Nonlinear Models
| Index Summary | mod1 | mod1b | mod1c | mod1d | 
|---|---|---|---|---|
| Index Nagelkerke R2 | 0.066 | 0.069 | 0.069 | 0.066 | 
| Index Brier Score | 0.143 | 0.143 | 0.143 | 0.143 | 
| Index C | 0.64955 | 0.6521 | 0.6519 | 0.6499 | 
- mod1b, mod1c, and mod1d have equal index rsquared (slightly better than mod1 though)
- Brier score not useful (all equal)
- mod1 has the best C statistic, but negligible
Table 2: Validated Fit Statistics Comparing `mod1` to Nonlinear Models
| Corrected Summary | mod1 | mod1b | mod1c | mod1d | 
|---|---|---|---|---|
| Corrected Nagelkerke R2 | 0.0614 | 0.0630 | 0.0634 | 0.0623 | 
| Corrected Brier Score | 0.143 | 0.143 | 0.143 | 0.143 | 
| Corrected C | 0.64495 | 0.64625 | 0.64635 | 0.6458 | 
- Corrected Nagelkerke R2: mod1c
- Corrected Brier Score: all equal
- Corrected C mod1c (but mod1b, mod1c, and mod1d are equal to 3 decimal points)
Overall winner: mod1c
- although mod1cis negligibly better, its rsquared is very close to 0 and its C statistic shows that the model does not perform much better than guessing. Thus, this is an extremely weak model.
The results shown in table 1 and table 2 were obtained from:
validate(mod1_lrm)          index.orig training    test optimism index.corrected  n
Dxy           0.3000   0.3025  0.2956   0.0069          0.2931 40
R2            0.0674   0.0688  0.0655   0.0033          0.0641 40
Intercept     0.0000   0.0000 -0.0371   0.0371         -0.0371 40
Slope         1.0000   1.0000  0.9755   0.0245          0.9755 40
Emax          0.0000   0.0000  0.0123   0.0123          0.0123 40
D             0.0422   0.0431  0.0409   0.0021          0.0400 40
U            -0.0001  -0.0001  0.0000  -0.0001          0.0000 40
Q             0.0423   0.0432  0.0409   0.0023          0.0400 40
B             0.1426   0.1426  0.1428  -0.0002          0.1428 40
g             0.6222   0.6277  0.6115   0.0162          0.6060 40
gp            0.0891   0.0900  0.0878   0.0022          0.0869 40validate(mod1b_lrm)          index.orig training    test optimism index.corrected  n
Dxy           0.3045   0.3099  0.2992   0.0107          0.2938 40
R2            0.0696   0.0722  0.0669   0.0054          0.0642 40
Intercept     0.0000   0.0000 -0.0553   0.0553         -0.0553 40
Slope         1.0000   1.0000  0.9599   0.0401          0.9599 40
Emax          0.0000   0.0000  0.0192   0.0192          0.0192 40
D             0.0435   0.0453  0.0418   0.0034          0.0401 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0437   0.0454  0.0418   0.0036          0.0401 40
B             0.1424   0.1421  0.1427  -0.0006          0.1429 40
g             0.6500   0.6630  0.6352   0.0277          0.6223 40
gp            0.0912   0.0928  0.0893   0.0034          0.0878 40validate(mod1c_lrm)          index.orig training    test optimism index.corrected  n
Dxy           0.3041   0.3099  0.2993   0.0106          0.2935 40
R2            0.0694   0.0724  0.0671   0.0053          0.0641 40
Intercept     0.0000   0.0000 -0.0518   0.0518         -0.0518 40
Slope         1.0000   1.0000  0.9608   0.0392          0.9608 40
Emax          0.0000   0.0000  0.0183   0.0183          0.0183 40
D             0.0434   0.0453  0.0419   0.0033          0.0401 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0435   0.0454  0.0419   0.0035          0.0400 40
B             0.1424   0.1418  0.1426  -0.0008          0.1432 40
g             0.6472   0.6605  0.6331   0.0273          0.6199 40
gp            0.0911   0.0927  0.0894   0.0033          0.0878 40validate(mod1d_lrm)          index.orig training    test optimism index.corrected  n
Dxy           0.3011   0.3071  0.2961   0.0109          0.2901 40
R2            0.0679   0.0711  0.0656   0.0055          0.0624 40
Intercept     0.0000   0.0000 -0.0599   0.0599         -0.0599 40
Slope         1.0000   1.0000  0.9587   0.0413          0.9587 40
Emax          0.0000   0.0000  0.0204   0.0204          0.0204 40
D             0.0425   0.0445  0.0410   0.0035          0.0390 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0426   0.0446  0.0409   0.0038          0.0388 40
B             0.1426   0.1424  0.1428  -0.0005          0.1430 40
g             0.6340   0.6478  0.6204   0.0275          0.6065 40
gp            0.0897   0.0916  0.0880   0.0036          0.0861 40Metrics for test sample
Below I am fitting the 4 models to the testing sample, imp_test.
mod1_aug_test <- augment(mod1_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))
mod1b_aug_test <- augment(mod1b_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))
mod1c_aug_test <- augment(mod1c_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))
mod1d_aug_test <- augment(mod1d_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))NOTE: Rather than setting the cutoff at 0.5, I set it at 0.27 (based on plot in the confusion matrix section). We do not have any predicted values as high as 0.5, so I needed to make it something lower. This subjectivity is a major limitation
Below is a table that compares the kappa and accuracy for each of the models in the holdout sample.
 comp <- bind_cols(
yardstick::metrics(data = mod1_aug_test,
        truth = obs, estimate = pred)%>% 
  select(.metric, mod1 = .estimate),
yardstick::metrics(data = mod1b_aug_test,
        truth = obs, estimate = pred) %>%
   select(mod1b = .estimate), 
yardstick::metrics(data = mod1c_aug_test,
        truth = obs, estimate = pred) %>% 
  select(mod1c = .estimate), 
  yardstick::metrics(data = mod1d_aug_test,
        truth = obs, estimate = pred) %>% 
  select(mod1d = .estimate))
comp %>% kable(dig=3)| .metric | mod1 | mod1b | mod1c | mod1d | 
|---|---|---|---|---|
| accuracy | 0.766 | 0.763 | 0.762 | 0.761 | 
| kap | 0.142 | 0.160 | 0.152 | 0.138 | 
Accuracy:
- Best for mod1 (which makes sense because the cutoff was set based on this model)
- only 77% of estimates correct is not very good
- Second best for mod1d
Kappa (measure of inter-rater reliability, perfect agreement=1)
- best for mod1b
- Kappa is basically the strength of the correlation between what we predicted and what the actual was. 0.161 is a really weak correlation
Final Model
I prefer mod1 based on the similar results for fit quality and its lower complexity:
1. overall assessment of fit quality:
- In sample fit statistics
- AIC, BIC: - mod1cwins
- ANOVA: none of the models resulted in a drop in deviance in comparison to - mod1(this finding doesn’t hold much weight)
- R2: all were equal when rounding to 2 decimal points (0.07) 
- Brier score: all identical to 3 decimal points (0.143) 
- C: all identical when rounding to 2 decimal points (0.65) 
- Validated fit statistics
- R2: all were equal when rounding to 2 decimal points (0.06) 
- Brier score: all identical to 3 decimal points (0.143) 
- C: all very similar 
- mod1band- mod1cwere the highest: 0.646
- mod1and- mod1dwere slightly lower: 0.645
2. Not worth the complication of adding non-linear terms
- According to the ANOVA tests, each of the models with nonlinear terms reduced the lack of fit in comparison to the original model
- There was no improvement in predicting as we can see by the lack of substantial improvement in the accuracy or kappa when evaluating the model out of sample
3. Non statistical considerations: we want simple models.
Model Parameters
Below is a listing of the model parameters for `mod1` fit to the entire data set (after multiple imputation) in terms of odds ratios, with 95% confidence intervals
m1_pool <- pool(m1_mods)
sum1 <- summary(m1_pool, exponentiate = TRUE, 
        conf.int = TRUE, conf.level = 0.95) %>% 
    select(-df) 
sum1 %>% kable(digits = c(3, 3, 2, 2, 3, 3))| term | estimate | std.error | statistic | p.value | 2.5 % | 97.5 % | 
|---|---|---|---|---|---|---|
| (Intercept) | 1.136 | 0.10 | 1.28 | 0.201 | 0.934 | 1.381 | 
| subst_abuseyes | 0.718 | 0.04 | -7.77 | 0.000 | 0.661 | 0.781 | 
| regionNortheast | 0.758 | 0.06 | -4.35 | 0.000 | 0.669 | 0.859 | 
| regionSouth | 1.218 | 0.06 | 3.27 | 0.001 | 1.082 | 1.371 | 
| regionWest | 1.300 | 0.07 | 3.93 | 0.000 | 1.141 | 1.481 | 
| regionMidwest | 0.947 | 0.07 | -0.76 | 0.447 | 0.822 | 1.090 | 
| age | 0.971 | 0.00 | -16.50 | 0.000 | 0.968 | 0.975 | 
| sexfemale | 0.870 | 0.05 | -2.99 | 0.003 | 0.794 | 0.953 | 
| insuranceMedicare | 0.670 | 0.05 | -7.46 | 0.000 | 0.603 | 0.745 | 
| insurancePrivate | 0.927 | 0.06 | -1.24 | 0.215 | 0.822 | 1.045 | 
| insuranceSelf_pay | 1.068 | 0.08 | 0.86 | 0.387 | 0.920 | 1.241 | 
| insuranceOther | 0.999 | 0.12 | -0.01 | 0.991 | 0.794 | 1.256 | 
| raceWhite | 0.955 | 0.05 | -0.87 | 0.386 | 0.861 | 1.060 | 
| raceHispanic | 1.180 | 0.06 | 2.76 | 0.006 | 1.049 | 1.328 | 
| raceOther | 1.195 | 0.11 | 1.62 | 0.105 | 0.964 | 1.483 | 
| raceAsian | 1.508 | 0.19 | 2.13 | 0.033 | 1.034 | 2.199 | 
| raceNativeA | 1.149 | 0.29 | 0.48 | 0.635 | 0.648 | 2.038 | 
| patient_locFringe | 1.090 | 0.06 | 1.40 | 0.160 | 0.966 | 1.229 | 
| patient_locmetro>250K | 0.995 | 0.06 | -0.08 | 0.932 | 0.882 | 1.122 | 
| patient_locmetro>50K | 0.968 | 0.10 | -0.34 | 0.735 | 0.802 | 1.168 | 
| patient_locmicro | 0.890 | 0.12 | -1.01 | 0.312 | 0.710 | 1.115 | 
| patient_locOther | 1.035 | 0.14 | 0.24 | 0.813 | 0.781 | 1.371 | 
| zipinc_qrtl48-61K | 1.046 | 0.05 | 0.84 | 0.400 | 0.941 | 1.163 | 
| zipinc_qrtl61-82K | 1.053 | 0.06 | 0.85 | 0.398 | 0.934 | 1.188 | 
| zipinc_qrtl82K+ | 1.028 | 0.07 | 0.37 | 0.708 | 0.888 | 1.192 | 
- After adjusting for - region,- age,- sex,- insurance,- race,- zipinc_qrtl, the odds of an AIDS defining illness in PLWH with a substance disorder is 0.726 95% CI ( 0.668 to 0.789) time the odds in those without a substance use disorder.
- given that the 95% CI is entirely below 1, the model suggests that having a SUD is associated with a lower odds of an AIDS defining illness 
Effect sizes
Below are the effect sizes for all elements of mod1 both numerically and graphically.
plot(summary(mod1_lrm))
kable(summary(mod1_lrm, conf.int=0.95), digits=3) | Low | High | Diff. | Effect | S.E. | Lower 0.95 | Upper 0.95 | Type | |
|---|---|---|---|---|---|---|---|---|
| age | 40 | 58 | 18 | -0.520 | 0.032 | -0.582 | -0.459 | 1 | 
| Odds Ratio | 40 | 58 | 18 | 0.594 | NA | 0.559 | 0.632 | 2 | 
| subst_abuse - yes:no | 1 | 2 | NA | -0.331 | 0.043 | -0.414 | -0.247 | 1 | 
| Odds Ratio | 1 | 2 | NA | 0.718 | NA | 0.661 | 0.781 | 2 | 
| region - Northeast:South_Atlantic | 1 | 2 | NA | -0.274 | 0.064 | -0.399 | -0.149 | 1 | 
| Odds Ratio | 1 | 2 | NA | 0.760 | NA | 0.671 | 0.861 | 2 | 
| region - South:South_Atlantic | 1 | 3 | NA | 0.200 | 0.060 | 0.082 | 0.318 | 1 | 
| Odds Ratio | 1 | 3 | NA | 1.222 | NA | 1.085 | 1.375 | 2 | 
| region - West:South_Atlantic | 1 | 4 | NA | 0.268 | 0.066 | 0.138 | 0.398 | 1 | 
| Odds Ratio | 1 | 4 | NA | 1.307 | NA | 1.148 | 1.489 | 2 | 
| region - Midwest:South_Atlantic | 1 | 5 | NA | -0.052 | 0.072 | -0.194 | 0.089 | 1 | 
| Odds Ratio | 1 | 5 | NA | 0.949 | NA | 0.824 | 1.093 | 2 | 
| sex - female:male | 1 | 2 | NA | -0.140 | 0.047 | -0.231 | -0.049 | 1 | 
| Odds Ratio | 1 | 2 | NA | 0.869 | NA | 0.794 | 0.952 | 2 | 
| insurance - Medicare:Medicaid | 1 | 2 | NA | -0.402 | 0.054 | -0.507 | -0.297 | 1 | 
| Odds Ratio | 1 | 2 | NA | 0.669 | NA | 0.602 | 0.743 | 2 | 
| insurance - Private:Medicaid | 1 | 3 | NA | -0.071 | 0.061 | -0.190 | 0.049 | 1 | 
| Odds Ratio | 1 | 3 | NA | 0.932 | NA | 0.827 | 1.050 | 2 | 
| insurance - Self_pay:Medicaid | 1 | 4 | NA | 0.066 | 0.076 | -0.083 | 0.216 | 1 | 
| Odds Ratio | 1 | 4 | NA | 1.069 | NA | 0.920 | 1.241 | 2 | 
| insurance - Other:Medicaid | 1 | 5 | NA | 0.002 | 0.117 | -0.227 | 0.231 | 1 | 
| Odds Ratio | 1 | 5 | NA | 1.002 | NA | 0.797 | 1.260 | 2 | 
| race - White:Black | 1 | 2 | NA | -0.049 | 0.053 | -0.152 | 0.055 | 1 | 
| Odds Ratio | 1 | 2 | NA | 0.953 | NA | 0.859 | 1.056 | 2 | 
| race - Hispanic:Black | 1 | 3 | NA | 0.163 | 0.060 | 0.046 | 0.280 | 1 | 
| Odds Ratio | 1 | 3 | NA | 1.177 | NA | 1.047 | 1.323 | 2 | 
| race - Other:Black | 1 | 4 | NA | 0.167 | 0.109 | -0.047 | 0.381 | 1 | 
| Odds Ratio | 1 | 4 | NA | 1.182 | NA | 0.954 | 1.464 | 2 | 
| race - Asian:Black | 1 | 5 | NA | 0.390 | 0.192 | 0.013 | 0.767 | 1 | 
| Odds Ratio | 1 | 5 | NA | 1.476 | NA | 1.013 | 2.152 | 2 | 
| race - NativeA:Black | 1 | 6 | NA | 0.086 | 0.287 | -0.476 | 0.648 | 1 | 
| Odds Ratio | 1 | 6 | NA | 1.090 | NA | 0.622 | 1.912 | 2 | 
| patient_loc - Fringe:Central | 1 | 2 | NA | 0.099 | 0.059 | -0.017 | 0.216 | 1 | 
| Odds Ratio | 1 | 2 | NA | 1.105 | NA | 0.983 | 1.241 | 2 | 
| patient_loc - metro>250K:Central | 1 | 3 | NA | -0.005 | 0.059 | -0.121 | 0.110 | 1 | 
| Odds Ratio | 1 | 3 | NA | 0.995 | NA | 0.886 | 1.116 | 2 | 
| patient_loc - metro>50K:Central | 1 | 4 | NA | -0.038 | 0.096 | -0.226 | 0.150 | 1 | 
| Odds Ratio | 1 | 4 | NA | 0.962 | NA | 0.798 | 1.161 | 2 | 
| patient_loc - micro:Central | 1 | 5 | NA | -0.112 | 0.115 | -0.337 | 0.113 | 1 | 
| Odds Ratio | 1 | 5 | NA | 0.894 | NA | 0.714 | 1.120 | 2 | 
| patient_loc - Other:Central | 1 | 6 | NA | 0.025 | 0.141 | -0.252 | 0.302 | 1 | 
| Odds Ratio | 1 | 6 | NA | 1.025 | NA | 0.777 | 1.352 | 2 | 
| zipinc_qrtl - 48-61K:<48K | 1 | 2 | NA | 0.040 | 0.052 | -0.063 | 0.143 | 1 | 
| Odds Ratio | 1 | 2 | NA | 1.041 | NA | 0.939 | 1.153 | 2 | 
| zipinc_qrtl - 61-82K:<48K | 1 | 3 | NA | 0.053 | 0.061 | -0.067 | 0.173 | 1 | 
| Odds Ratio | 1 | 3 | NA | 1.054 | NA | 0.936 | 1.189 | 2 | 
| zipinc_qrtl - 82K+:<48K | 1 | 4 | NA | 0.012 | 0.072 | -0.129 | 0.154 | 1 | 
| Odds Ratio | 1 | 4 | NA | 1.013 | NA | 0.879 | 1.166 | 2 | 
Conclusion for Analysis 1
My first research question was, “In PLWH in 2018, how does hospitalization due to an AIDS defining illnesses in those with a SUD compare to those without a SUD?” This is an important question because the prevalence of SUD is high in PLWH (estimated 48%) and it has shown to have deleterious impacts on medication adherence, retention to services, time to diagnosis, and care linkage. AIDS defining illnesses are an indication of disease progression, thus it would be valuable to quantify the effect SUD has on disease progression. There have been 3 cohort studies (post HAART era) that have evaluated this relationship, each of which found a higher burden of AIDS defining illnesses in those with SUD. However, these cohort use data from years 2003 and 2004. Thus, these studies were conducted before the introduction of integrase strand inhibitors and single tablet regimens, both of which have greatly impacted adherence and viral load suppression. Furthermore, none of these studies were nationally representative.
My model indicated that after adjusting for demographics and socioeconomic factors, the odds of hospitalization due to an AIDS defining illness in those with SUD was 0.726 95% CI (0.668 to 0.789). Clearly these results do not match the previous findings. However, my study differs from the previous studies in some important ways. The previous studies:
1. Adjusted for information that I did not have access to such as lab values ( i.e. CD4 count, viral load), opportunistic infection prophylaxis, and sharing needles
2. Had different definitions of AIDS defining illness (eg Lucas et al required that PCP and candida esophagitis had to be recurrent in order to be considered an AIDS defining opportunistic infection)
3. Had different definitions of SUD: the other study’s definitions of substance abuse disorder only included cocaine/crack or heroin. My definition of substance use disorder was much broader and included abuse of alcohol, opioids, sedatives, hypnotics, anxiolytics, cocaine, other stimulants, hallucinogens, inhalants, or other psychoactive substances/multiple drug use.
4. Were not nationally representative: the study by Cook et al (n=1,686) and Anastos et al (n=961) only included women, and the study by Lucas et al. only included people from Maryland (n=1,851). My study was nationally representative of all U.S. hospitalizations in 2018 (n=24,118)
Limitations
1. This data was collected for the purpose of reimbursement, not research. Thus, conditions may not have been coded if there was no reimbursement associated with them.
2. I could not adjust for the patients HIV medication regimen, previous history of opportunistic infections, or opportunistic infection prophylaxis
3. Lack of granularity of the ICD10 codes
4. I could have adjusted for immunocompromising comorbidities
5. My final model was very weak (r2=0.07, C statistic=0.65)
6. The assessment of the model’s accuracy and kappa value were based on an arbitrary cut off of 0.27
7. I had a lot of trouble determining the Youden Index. A major limitation to my confusion matrix was the arbitrary cutoff point of 0.27. Determining the Youden Index would allow me to find the point where both sensitivity and specificity are maximized.
**Strengths of my study**:
1. Nationally representative with a large sample size
2. Very low percentage of missing values (maximum was 3.5%) and I used multiple imputation to deal with the missing values
3. Evaluated multiple different models with different combinations of nonlinear terms
4. Adjusted for demographics and socioeconomic status
My next steps are:
1. adjust for immunocompromising conditions
2. include propensity score matching
- age, gender, race, total comorbidity number (not sure how to get that though),# of procedures, admission type, insurance, income quartile, hospital bed size, location, hospital teaching status (HCUP-NIS paper matched on https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5514640/)
3. refine the definition of substance use disorder to stimulant use only (I read that stimulant use is associated with increased HIV viral replication and there are a multitude of studies showing the relationship between increased viral load and stimulant use)
4. Do a stability analysis where I just do complete cases