Preliminaries

library(magrittr);
library(janitor); library(naniar)
library(rms)
library(mice)
# mice = multiple imputation through chained equations
library(broom); library(yardstick)
library(tableone)
library(pscl)
library(tidyverse)
# library(cutpointr)
# library(OptimalCutpoints)
theme_set(theme_bw())

1 Background

It is estimated that almost have of people living with HIV in the US have a substance use disorder. This is a major public health concern because it has been found that people living with HIV and a comorbid substance use disorder (SUD) have lower retention to care and decreased medication adherence. Several previous studies have demonstrated that the incidence of opportunistic infections, a marker of disease progression, is higher in this population. There have been no recent studies that evaluate the relationship between a SUD and the presence of an opportunistic infection (i.e. AIDS defining illness) in PLWH. Furthermore, there has yet to be a study which evaluates this syndemic nationally.

2 Research Questions

In people living with HIV (PLWH) in 2018, how does hospitalization due to an AIDS defining illnesses in those with a SUD compare to those without a SUD?
PLWH in 2018, how does the length of hospital stay for people with SUD compare to those without SUD?

3 My Data

Healthcare Cost and Utilization Project, Nationwide Inpatient Sample (HCUP-NIS) is the largest available all-payer inpatient healthcare administrative data set. It approximates a 20-percent stratified sample of all discharges from United States hospitals. It constitutes data from 48 states and 10,000 community hospitals, representing 95% of the United States population. Data from each record contains information regarding patient demographics, diagnoses, procedures, and other information associated with a hospital admission.

The data can be purchased by the public with at the following link: https://www.hcup-us.ahrq.gov/nisoverview.jsp

Strengths of how this data set relates to my research question:

Nationally representative
Inpatient hospitalizations
Collects information on sociodemographic factors which I can adjust for
Collects information on up to 40 diagnoses, thus I can capture both the exposure and outcome
Collects information on length of stay

Limitations of the data set:

the data quality of secondary databases is not perfect as the diagnoses codes may not necessarily be accurate, granular, or complete
- The HIV population tends to have many co-occurring conditions, and thus it is possible that not all SUD conditions were not recorded. Therefore, there may be some people in the unexposed group who should be in the exposed group
The latest data available is from 2018. Drug therapy has dramatically changed with integrase inhibitors becoming first line and having drugs with longer half-lives and easier to adhere too. This is especially important for the SUD population who are less likely to be adherent. Thus, with with all of the advances in care, the 2018 analysis may not actually reflect 2021’s gaps in care.

3.1 Data Ingest

Below I am ingesting my data hiv_raw.

hiv_raw <- read.csv("hiv_oi.csv") %>% 
  clean_names()

hiv_raw <- hiv_raw %>% 
  haven::zap_label()  %>% 
  mutate(key_nis = as.character(key_nis))

dim(hiv_raw)

[1] 24203    20

As originally loaded, the hiv_raw data contain 24203 rows and 20 columns.

3.2 Tidying, Data Cleaning and Data Management

Below I am cleaning the data according to the HCUP_NIS code book found at https://www.hcup-us.ahrq.gov/db/nation/nis/nisdde.jsp

In summary I have:

1.) converted all variables to factors except for age and key_nis

2.) coded the variable levels for factors with more descriptive names rather than numbers

3.) reordered according to frequency with fct_infreq

4.) selected only the variables that I will use

5.) converted los to a number

hiv <- hiv_raw %>% 
mutate(female=as.numeric(female)) %>% 
  mutate(sex = fct_recode(factor(female),
"male" = "0", "female" = "1"),
        race = fct_recode(factor(race),
                "White" = "1", 
                "Black" = "2", 
                "Hispanic"= "3",
                "Asian" = "4",
                "NativeA"= "5",
                "Other" = "6"),
    race = fct_infreq(race), 
    zipinc_qrtl = fct_recode(factor(zipinc_qrtl),
                            "<48K"= "1",
                            "48-61K" = "2",
                            "61-82K"= "3",
                            "82K+" = "4")) %>% 
mutate(pay1=as.numeric(pay1)) %>% 
  mutate(insurance  = fct_recode(factor(pay1),
                  "Medicare" = "1",
                  "Medicaid" = "2",
                  "Private" = "3",
                  "Self_pay" = "4",
                  "Other" = "5",
                  "Other" = "6"),
insurance = fct_infreq(insurance), 
    patient_loc =  fct_recode(factor(pl_nchs),
                  "Central" = "1",
                  "Fringe" = "2",
                  "metro>250K" = "3",
                  "metro>50K" = "4",
                  "micro" = "5",
                  "Other" = "6" ),
patient_loc = fct_infreq(patient_loc)) %>% 
  mutate(region = fct_recode(factor(hosp_division),
                  "Northeast" = "1",
                  "Northeast" = "2",
                  "Midwest" = "3",
                  "Midwest" = "4",
                  "South_Atlantic" = "5",
                  "South" = "6",
                  "South" = "7",
                  "West" = "8",
                  "West" = "9"),
        region = fct_infreq(region)) %>% 
  mutate(ED_record = fct_recode(factor(hcup_ed),
          "no" = "0",
          "yes" = "1", "yes" = "2", "yes" ="3", "yes"="4")) %>% 
  mutate(subst_abuse=fct_recode(factor(sa),
         "yes" = "1",
         "no"= "0"),
         subst_abuse = fct_relevel(subst_abuse, "no")) %>% 
  mutate(los=as.numeric(los)) %>% 
  
 mutate(AIDS_f = fct_recode(factor(oi),
                           "yes"= "1",
                           "no" ="0")) %>% 
  mutate(AIDS_f=fct_relevel(AIDS_f, "no")) %>% 
  rename(AIDS= oi) %>% 
  select(key_nis, subst_abuse, AIDS, AIDS_f, los, age, sex, race, region, zipinc_qrtl, insurance, patient_loc, ED_record)

Note: I had to do as.numeric (convert to numeric) for female and insurance because there were null values that weren’t capturing and instead just under a blank space, so they automatically got converted to NA when I made them a number first.

Below we can see that we have all of the variables in the form that they should be in:

1.) character:key_nis

2.) factor: subst_abuse, AIDS, sex, race, region, zipinc_qrtl, insurance, patient_loc,ED_record

3.) numeric: oi (will be an indicator in logistic regression), los,

head(hiv)

   key_nis subst_abuse AIDS AIDS_f los age    sex     race    region
1 10317955         yes    0     no   2  68   male    White Northeast
2 10160121         yes    0     no   8  31 female    White Northeast
3 10137460          no    0     no   3  50   male    White Northeast
4 10324914          no    0     no   5  50   male    White Northeast
5 10029616         yes    0     no   2  41 female Hispanic Northeast
6 10204914          no    0     no   1  46 female    White Northeast
  zipinc_qrtl insurance patient_loc ED_record
1        <48K  Medicare       Other        no
2      61-82K  Medicaid  metro>250K       yes
3        <48K  Medicare       Other        no
4        <48K  Medicare       Other        no
5        <48K  Medicaid  metro>250K       yes
6        <48K  Medicare  metro>250K       yes

3.3 eligibiity

I am only going to include adults in this study (age>19)

hiv <- hiv %>% 
  filter(age>19)

3.4 Checking the variables

3.4.1 Categorical Variables

Here I am making sure that all of the factor levels work out (subst_abuse, AIDS, sex, race, region, zipinc_qrtl, insurance, patient_loc,ED_record)

hiv %>% count(subst_abuse)

  subst_abuse     n
1          no 12329
2         yes 11791

The two categories of subst_abuse look good.

hiv %>% count(AIDS_f)

  AIDS_f     n
1     no 19720
2    yes  4400

The two categories of AIDS_f look good

hiv %>% count(sex)

     sex     n
1   male 16847
2 female  7266
3   <NA>     7

The two categories of sex look good

hiv %>% count(race)

      race     n
1    Black 12202
2    White  6812
3 Hispanic  3633
4    Other   848
5    Asian   224
6  NativeA   117
7     <NA>   284

The 6 categories of race look good

hiv %>% count(region)

          region    n
1 South_Atlantic 7337
2      Northeast 5641
3          South 4282
4           West 4007
5        Midwest 2853

The 5 categories of region look good

hiv %>% count(zipinc_qrtl)

  zipinc_qrtl     n
1        <48K 11397
2      48-61K  5409
3      61-82K  3840
4        82K+  2627
5        <NA>   847

The 4 categories of zipinc_qrtl look good

hiv %>% count(insurance)

  insurance    n
1  Medicaid 9511
2  Medicare 8490
3   Private 3654
4  Self_pay 1753
5     Other  687
6      <NA>   25

The 5 categories of insurance look good

hiv %>% count(ED_record)

  ED_record     n
1        no  4791
2       yes 19329

The 2 categories of ED_record look good

hiv %>% count(patient_loc)

  patient_loc     n
1     Central 12920
2      Fringe  4043
3  metro>250K  3914
4   metro>50K  1276
5       micro   912
6       Other   535
7        <NA>   520

6 categories for patient_loc look good

3.4.2 Quantitative variables

I have two quantitative variables: length of stay (los) and age

3.4.2.1 age

Below are numeric and plotted summaries of the age distribution.

mosaic::favstats(~age, data=hiv)

 min Q1 median Q3 max     mean       sd     n missing
  20 40     51 58  90 49.48238 12.72655 24120       0

ggstatsplot::gghistostats(
  data = hiv,
  x = age,
  type = "np",
  xlab = "age",
  bar.fill = "lightblue")

It looks like the age range is fine.

3.4.2.2 length of stay

Our length of stays range from 0 to 294. Those all look plausible.

mosaic::favstats(~los, data=hiv)

 min Q1 median Q3 max     mean       sd     n missing
   0  2      4  8 294 7.027739 9.720441 24118       2

ggstatsplot::gghistostats(
  data = hiv,
  x = los,
  type = "np",
  xlab = "length of stay",
  bar.fill = "lightblue")

I just want to say how awesome it is that we are seeing PLWH living into their 90’s. Gives me the chills. This is a miracle that really shows how far we have come in treating this disease (unimaginable 30 years ago)

3.5 Missingness

I have 1685 missing observations in the hiv data set.

Below we can see that we have the most missingness for zipinc_qrtl (3.5%), patient_loc (2.2%), and race (1.2%)

gg_miss_var(hiv)

This means that if I do multiple imputation, I should do at least 4 iterations.

miss_var_summary(hiv)

# A tibble: 13 x 3
   variable    n_miss pct_miss
   <chr>        <int>    <dbl>
 1 zipinc_qrtl    847  3.51   
 2 patient_loc    520  2.16   
 3 race           284  1.18   
 4 insurance       25  0.104  
 5 sex              7  0.0290 
 6 los              2  0.00829
 7 key_nis          0  0      
 8 subst_abuse      0  0      
 9 AIDS             0  0      
10 AIDS_f           0  0      
11 age              0  0      
12 region           0  0      
13 ED_record        0  0

About 95% of the cases aren’t missing any data.

miss_case_table(hiv)

# A tibble: 4 x 3
  n_miss_in_case n_cases pct_cases
           <int>   <int>     <dbl>
1              0   22971   95.2   
2              1     619    2.57  
3              2     524    2.17  
4              3       6    0.0249

3.5.1 Removing observations with missing outcome

There were 2 people missing data on the outcome, los. I will remove them.

hiv <- hiv %>% filter(complete.cases(los))

3.5.2 Missingness mechanism

I am assuming that the data are missing at random. I had originally thought it was MNAR, but the issue here is why the data are missing. Although my prediction will not be perfect, the covariates I will use to predict the missing values may do a reasonable job.

3.6 Tidied Tibble

Our tibble hiv contains 24118 rows (patients) and 13 columns (variables). Each variable is contained in a column, and each row represents a single key_nis. All variables now have appropriate types.

head(hiv) %>% kable()

key_nis	subst_abuse	AIDS_f	los	age	sex	race	region	zipinc_qrtl	insurance	patient_loc	ED_record
10317955	yes	no	2	68	male	White	Northeast	<48K	Medicare	Other	no
10160121	yes	no	8	31	female	White	Northeast	61-82K	Medicaid	metro>250K	yes
10137460	no	no	3	50	male	White	Northeast	<48K	Medicare	Other	no
10324914	no	no	5	50	male	White	Northeast	<48K	Medicare	Other	no
10029616	yes	no	2	41	female	Hispanic	Northeast	<48K	Medicaid	metro>250K	yes
10204914	no	no	1	46	female	White	Northeast	<48K	Medicare	metro>250K	yes

I have also saved the tidied tibble as an R data set

saveRDS(hiv, "hiv.Rds")

4 Code Book and Clean Data Summary

Sample Size The data in our complete hiv sample consist of 24118 subjects from HCUP-NIS with a diagnosis of HIV and between the ages of 20 and 90 in whom our outcome variable was measured.
Missingness Of the 24118 subjects, 22971 have complete data on all variables listed below.
Our outcome variables are los and AIDS.

los is the number of days that the patient was hospitalized for. HCUP-NIS calculated it by subtracting the admission date from the discharge date
AIDS is if the person had a diagnosis for an opportunistic infection, which were AIDS defining. NOTE the definition of AIDS is either (1) CD4 <200 OR presence of an AIDS defining opportunistic infection. According to the CDC, AIDS defining illnesses include candidiasis of the esophagus (B37.81), bronchi , trachea, or lungs (B371); invasive cervical cancer (C53); coccidiomycosis (B38); cryptococcosis (B45); cryptosporidiosis(A07.2); cytomegalovirus disease or CMV(B25); histoplasmosis (B39); isosporiasis (A07.3); Kaposi sarcoma (C46); Burkitt’s, immunoblastic, Hodgkin’s, and Non- Hodgkin’s lymphoma (Burkitt’s, immunoblastic); mycobacterium avium complex (A31.2,A31.8); mycobacterium tuberculosis (A15); pneumocystis pneumonia (B59); recurrent pneumonia (Z87.01); progressive multifocal leukoencephalopathy (A81.2), salmonella septicemia (A02.1) and toxoplasmosis of brain (B58.2)

All other variables listed below will serve as candidate predictors for our models.
The other variable contained in my tidy tibble is key_nis which is the key_nis identifying code.

paste(colnames(hiv), collapse = " | ")

[1] "key_nis | subst_abuse | AIDS | AIDS_f | los | age | sex | race | region | zipinc_qrtl | insurance | patient_loc | ED_record"

Variable	Type	Description
key_nis	character	key_nis identifier
subst_abuse	binary	main predictor whether or not somebody has of substance use disorder. Patients were classified as having a history of SUD if they had an ICD-10 code for abuse of alcohol (F10), opioids, sedatives, hypnotics, anxiolytics (F13), cocaine (F14), other stimulants (F15), hallucinogens (F16), inhalants (F18), or other psychoactive substances/multiple drug use (F19) (yes/no)
age	quant	age in years.
sex	binary	male, female.
race	5-cat	Black, White, Hispanic, Other, Asian, Native American
region	5-cat	Northeast, Midwest, South, South_Atlantic, West
zipinc_qrtl	4-cat	Median household income for patient’s ZIP Code (based on current year). Values include <48K, 48-61K, 61-82K, 82K+.
insurance	5-cat	expected primary payer (Medicare, Medicaid, private insurance, self pay, other)
patient_loc	6-cat	Patient Location (“Central” counties of metro areas of >=1 million population, “Fringe” counties of metro areas of >=1 million population, Counties in metro areas of 250,000-999,999 population, Counties in metro areas of 50,000-249,999 population, Micropolitan counties, Not metropolitan or micropolitan counties)
ED_record	binary	records that have evidence of emergency department (ED) services reported on the HCUP record (yes/no)

4.1 Data Distribution Description

I have used the html function applied to an Hmisc::describe() to provide a clean data summary

hiv %>% 
    select(-key_nis) %>%
    Hmisc::describe() %>% Hmisc::html()

.

12 Variables 24118 Observations

subst_abuse

n	missing	distinct
24118	0	2

 Value         no   yes
 Frequency  12327 11791
 Proportion 0.511 0.489

AIDS

n	missing	distinct	Info	Sum	Mean	Gmd
24118	0	2	0.447	4399	0.1824	0.2983

AIDS_f

n	missing	distinct
24118	0	2

 Value         no   yes
 Frequency  19719  4399
 Proportion 0.818 0.182

los

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
24118	0	122	0.99	7.028	7.125	1	1	2	4	8	15	21

lowest : 0 1 2 3 4 , highest: 185 196 204 247 294

age

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
24118	0	71	0.999	49.48	14.45	27	31	40	51	58	65	69

lowest : 20 21 22 23 24 , highest: 86 87 88 89 90

sex

n	missing	distinct
24111	7	2

 Value        male female
 Frequency   16846   7265
 Proportion  0.699  0.301

race

n	missing	distinct
23834	284	6

lowest : Black White Hispanic Other Asian , highest: White Hispanic Other Asian NativeA

 Value         Black    White Hispanic    Other    Asian  NativeA
 Frequency     12202     6812     3632      847      224      117
 Proportion    0.512    0.286    0.152    0.036    0.009    0.005

region

n	missing	distinct
24118	0	5

lowest :	South_Atlantic	Northeast	South	West	Midwest
highest:	South_Atlantic	Northeast	South	West	Midwest

 Value      South_Atlantic      Northeast          South           West
 Frequency            7337           5640           4282           4006
 Proportion          0.304          0.234          0.178          0.166
                          
 Value             Midwest
 Frequency            2853
 Proportion          0.118

zipinc_qrtl

n	missing	distinct
23272	846	4

 Value        <48K 48-61K 61-82K   82K+
 Frequency   11397   5408   3840   2627
 Proportion  0.490  0.232  0.165  0.113

insurance

n	missing	distinct
24093	25	5

lowest : Medicaid Medicare Private Self_pay Other , highest: Medicaid Medicare Private Self_pay Other

 Value      Medicaid Medicare  Private Self_pay    Other
 Frequency      9509     8490     3654     1753      687
 Proportion    0.395    0.352    0.152    0.073    0.029

patient_loc

n	missing	distinct
23599	519	6

lowest :	Central	Fringe	metro>250K	metro>50K	micro
highest:	Fringe	metro>250K	metro>50K	micro	Other

 Value         Central     Fringe metro>250K  metro>50K      micro      Other
 Frequency       12919       4043       3914       1276        912        535
 Proportion      0.547      0.171      0.166      0.054      0.039      0.023

ED_record

n	missing	distinct
24118	0	2

 Value         no   yes
 Frequency   4790 19328
 Proportion 0.199 0.801

5 Analysis 1 AIDS defining Illness (Logistic Regression)

5.1 Plans

5.1.1 Binary Outcome

My binary outcome is AIDS
There are no missing cases on this outcome

hiv %>% select(AIDS) %>% miss_case_table() %>% kable()

n_miss_in_case	n_cases	pct_cases
0	24118	100

5.1.2 Planned Predictors

subst_abuse

main predictor
binary
0% missing

zipinc_qrtl

4 category
3.5% missing

patient_loc

6 categories
2.15% missing

race

5 categories
1.2% missing

insurance

5 categories
0.10% missing

sex

binary
0.03% missing

age

continuous
0% missing

region

5 categories
0% missing

5.2 Describing the data (visualization)

5.2.1 Two by two table

I have made a two by two table

The exposure is subst_abuse, which is represented in the rows of my table below. It is a factor that has two levels: yes and no, which represents patients who have a substance use disorder (abuse of alcohol, opioids, sedatives, hypnotics, anxiolytics, cocaine, other stimulants, hallucinogens, inhalants, or other psychoactive substances/multiple drug use).
The outcome is AIDS_f, which is represented in the columns of my table below. It is a factor that has two levels: yes and no, which represents patients who have an opportunistic infection (AIDS defining illness)
- according to the CDC is an AIDS defining illness (candidiasis of the esophagus (B37.81), bronchi , trachea, or lungs (B371); invasive cervical cancer (C53); coccidiomycosis (B38); cryptococcosis (B45); cryptosporidiosis(A07.2); cytomegalovirus disease or CMV(B25); histoplasmosis (B39); isosporiasis (A07.3); Kaposi sarcoma (C46); Burkitt’s, immunoblastic, Hodgkin’s, and Non- Hodgkin’s lymphoma (Burkitt’s, immunoblastic); mycobacterium avium complex (A31.2, A31.8); mycobacterium tuberculosis (A15); pneumocystis pneumonia (B59); recurrent pneumonia (Z87.01); progressive multifocal leukoencephalopathy (A81.2), salmonella septicemia (A02.1) and toxoplasmosis of brain (B58.2))

hiv_reordered <- hiv %>% 
mutate(subst_abuse=fct_relevel(subst_abuse, "yes" )) %>% 
  mutate(AIDS_f=fct_relevel(AIDS_f, "yes" ))

tableE <- hiv_reordered %>%  
 tabyl (subst_abuse, AIDS_f) 

tableE %>% 
  adorn_totals(where = c("row", "col")) %>% 
  adorn_percentages(denom = "row") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns(position = "front") %>% kable()

subst_abuse	yes	no	Total
yes	1929 (16.4%)	9862 (83.6%)	11791 (100.0%)
no	2470 (20.0%)	9857 (80.0%)	12327 (100.0%)
Total	4399 (18.2%)	19719 (81.8%)	24118 (100.0%)

Among those who had a substance abuse disorder, 16.4%% (1929 /11791) had an AIDS defining illness. In those who did not have a substance use disorder, 20.0%% (2471/12329) had an and AIDS defining illness

This lower prevalence of AIDS defining illness in people with SUD is surprising given previous literature. After adjustment, we will see if this relationship still holds or if it is meaningfully lower. I’m excited to see what happens!

5.2.2 Table one

This table has the covariates that I will be adjusting for as I explore the relationship between subst_abuse and AIDS.

vars <- c("region", "age", "sex", "insurance", "race", "patient_loc", "zipinc_qrtl")

factorvars <- c("region", "sex", "insurance", "race", "patient_loc", "zipinc_qrtl")

trt <- c("subst_abuse")

table01 <- CreateTableOne(data = hiv_reordered,  
                       vars = vars, 
                       factorVars= factorvars,
                       strata = trt)
print(table01, verbose=TRUE)

                   Stratified by subst_abuse
                    yes           no            p      test
  n                 11791         12327                    
  region (%)                                    <0.001     
     South_Atlantic  3392 (28.8)   3945 (32.0)             
     Northeast       2944 (25.0)   2696 (21.9)             
     South           1956 (16.6)   2326 (18.9)             
     West            2001 (17.0)   2005 (16.3)             
     Midwest         1498 (12.7)   1355 (11.0)             
  age (mean (SD))   48.22 (11.58) 50.69 (13.62) <0.001     
  sex = female (%)   3413 (29.0)   3852 (31.3)  <0.001     
  insurance (%)                                 <0.001     
     Medicaid        5773 (49.0)   3736 (30.3)             
     Medicare        3595 (30.5)   4895 (39.7)             
     Private         1185 (10.1)   2469 (20.0)             
     Self_pay         892 ( 7.6)    861 ( 7.0)             
     Other            330 ( 2.8)    357 ( 2.9)             
  race (%)                                      <0.001     
     Black           5990 (51.4)   6212 (51.0)             
     White           3534 (30.3)   3278 (26.9)             
     Hispanic        1590 (13.6)   2042 (16.8)             
     Other            402 ( 3.4)    445 ( 3.7)             
     Asian             65 ( 0.6)    159 ( 1.3)             
     NativeA           73 ( 0.6)     44 ( 0.4)             
  patient_loc (%)                               <0.001     
     Central         6453 (56.7)   6466 (52.9)             
     Fringe          1642 (14.4)   2401 (19.7)             
     metro>250K      1990 (17.5)   1924 (15.8)             
     metro>50K        604 ( 5.3)    672 ( 5.5)             
     micro            440 ( 3.9)    472 ( 3.9)             
     Other            258 ( 2.3)    277 ( 2.3)             
  zipinc_qrtl (%)                               <0.001     
     <48K            6028 (53.7)   5369 (44.6)             
     48-61K          2499 (22.3)   2909 (24.2)             
     61-82K          1669 (14.9)   2171 (18.0)             
     82K+            1032 ( 9.2)   1595 (13.2)

Caption: Baseline characteristics for 24,120 hospitalized HIV patients in 2018 (HCUP-NIS data) in those with substance use disorder versus those without. NOTE: percent of missing values: zipinc_qrtl (3.5%), patient_loc (2.15%), race (1.2%), insurance (0.10%), sex (0.03%), age (0%), region (0%)

We can see that there is definitely an imbalance between some of these baseline characteristics. The most meaningful differences appear in:

median income based on zip for the <48K group
more medicaid insurance in the SUD group
the people in the SUD group are slightly younger

These demographic factors have been adjusted for in other papers which have shown AIDS defining illness to be higher in those with SUD. So we will see what happens!

5.3 Splitting the data

I will start by splitting the data into separate training and testing samples
I have used the option strata to ensure that the same percentage of people in both samples have the outcome of an AIDS defining illness

hiv1 <- hiv %>% select(subst_abuse, AIDS, region, age, sex, insurance, race, patient_loc, zipinc_qrtl )

set.seed(30)
hiv_split1 <- rsample::initial_split(hiv1, prop = 0.7,
strata = AIDS)
hiv_train1 <- rsample::training(hiv_split1) 
hiv_test1 <- rsample::testing(hiv_split1)
dim(hiv_test1)

[1] 7234    9

Below we can see that 18.2% of people in each group had the outcome of an AIDS defining illness

hiv_train1 %>% tabyl(AIDS) %>% kable(dig=3)

AIDS	n	percent
0	13804	0.818
1	3080	0.182

hiv_test1 %>% tabyl(AIDS) %>% kable(dig=3)

AIDS	n	percent
0	5915	0.818
1	1319	0.182

5.4 Multiple Imputation

5.4.1 training sample multiple imputation

I will impute values for the predictors zipinc_qrtl, patient_loc, race, and insurance

Since the highest percent of missingness I have is 3.5% (zipinc_qrtl), I will do 4 imputed data sets (m=4)

hiv1_mice<- mice(hiv_train1, m = 4, printFlag = F)

Below is a summary of the multiple imputation process

summary(hiv1_mice)

Class: mids
Number of multiple imputations:  4 
Imputation methods:
subst_abuse        AIDS      region         age         sex   insurance 
         ""          ""          ""          ""    "logreg"   "polyreg" 
       race patient_loc zipinc_qrtl 
  "polyreg"   "polyreg"   "polyreg" 
PredictorMatrix:
            subst_abuse AIDS region age sex insurance race patient_loc
subst_abuse           0    1      1   1   1         1    1           1
AIDS                  1    0      1   1   1         1    1           1
region                1    1      0   1   1         1    1           1
age                   1    1      1   0   1         1    1           1
sex                   1    1      1   1   0         1    1           1
insurance             1    1      1   1   1         0    1           1
            zipinc_qrtl
subst_abuse           1
AIDS                  1
region                1
age                   1
sex                   1
insurance             1

we can see that I did 4 imputations, which variables had missing, and how those variables were imputed.

5.4.2 testing sample single imputatoin

I am using mice, but just pulling out one imputed data set which I will call imp_test. I will use this when I am validating.

hivtest_mice<- mice(hiv_test1, m = 1, printFlag = F)
imp_test <- complete(hivtest_mice, 1) %>% tibble() 

dim(imp_test)

[1] 7234    9

Below I am just checking to make sure that I have no more missing

n_miss(imp_test)

[1] 0

no more missing!

5.5 Model 1

5.5.1 Fitting Model 1

mod1predicts the log odds of AIDS using the predictors subst_abuse, region, age, sex, insurance, race, patient_loc, zipinc_qrtl
I chose these predictors based on previous literature and reasoning

5.5.1.1 glm model with multiple imputation

First I am running mod1 on each of the 4 imputed data sets

m1_mods <- with(hiv1_mice, 
                glm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl,
                    family = binomial))

5.5.1.2 lrm & glm with single imputation

Because lrm does not work with mice, I will build an lrm model from one of the 4 imputation sets (lrm requires areg_impute)

The code below stores the 4th imputed data set in imp_4

imp_4 <- complete(hiv1_mice, 4) %>% tibble() 

dim(imp_4)

[1] 16884     9

zz <- datadist(imp_4) 
options(datadist = "zz")

mod1_lrm <- lrm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)

Also, to use tools like augment for the confusion matrix, I will also need to build a glm model with a single imputed data set.

mod1_glm <- glm(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4)

5.5.2 tidied table of regression coefficients

Below is a table of the exponentiated coefficients so that they can be interpreted as odds ratios.

NOTE: I have first pooled the results from the 4 analyses with the 4 imputed data sets (m1_pool)

m1_pool <- pool(m1_mods)
sum1 <- summary(m1_pool, exponentiate = TRUE, 
        conf.int = TRUE, conf.level = 0.95) %>% 
    select(-df) 

sum1 %>% kable(digits = c(3, 3, 2, 2, 3, 3))

term	estimate	std.error	statistic	p.value	2.5 %	97.5 %
(Intercept)	1.121	0.10	1.15	0.250	0.922	1.363
subst_abuseyes	0.726	0.04	-7.54	0.000	0.668	0.789
regionNortheast	0.781	0.06	-3.94	0.000	0.691	0.883
regionSouth	1.197	0.06	2.98	0.003	1.063	1.348
regionWest	1.290	0.07	3.84	0.000	1.133	1.470
regionMidwest	0.878	0.07	-1.79	0.074	0.761	1.013
age	0.971	0.00	-16.63	0.000	0.968	0.975
sexfemale	0.884	0.05	-2.67	0.008	0.807	0.968
insuranceMedicare	0.682	0.05	-7.13	0.000	0.614	0.758
insurancePrivate	0.986	0.06	-0.23	0.820	0.876	1.111
insuranceSelf_pay	1.017	0.08	0.22	0.826	0.874	1.184
insuranceOther	0.959	0.12	-0.35	0.725	0.761	1.209
raceWhite	1.000	0.05	0.00	0.997	0.902	1.109
raceHispanic	1.175	0.06	2.70	0.007	1.045	1.322
raceOther	1.145	0.11	1.22	0.224	0.920	1.426
raceAsian	1.496	0.18	2.19	0.029	1.043	2.146
raceNativeA	1.328	0.28	1.02	0.308	0.769	2.291
patient_locFringe	1.170	0.06	2.66	0.008	1.042	1.313
patient_locmetro>250K	0.986	0.06	-0.23	0.815	0.878	1.108
patient_locmetro>50K	0.981	0.10	-0.20	0.838	0.813	1.183
patient_locmicro	0.942	0.11	-0.53	0.594	0.756	1.174
patient_locOther	1.029	0.14	0.20	0.842	0.779	1.358
zipinc_qrtl48-61K	1.025	0.05	0.45	0.651	0.921	1.141
zipinc_qrtl61-82K	1.028	0.06	0.45	0.651	0.911	1.160
zipinc_qrtl82K+	1.005	0.07	0.07	0.944	0.873	1.158

Among hospitalized PLWH (N=16884), after adjusting for region, age,sex, insurance, race, patient_loc, zipinc_qrtl,mod1 predicts that the odds of having and AIDS defining illness in those with SUD is 0.726 (95% CI 0.668, 0.789) times those without SUD

given that the 95% CI is entirely below 1, the model suggests that having a SUD is associated with a lower odds of an AIDS defining illness

5.5.3 key fit summary statistics

Below are the key fit summary statistics like the Nagelkerke R-square and the area under the ROC curve as they are presented in the lrm output

The r2 is very low (0.066) as well as the C statistic (0.650)

mod1_lrm

Logistic Regression Model
 
 lrm(formula = AIDS ~ subst_abuse + region + age + sex + insurance + 
     race + patient_loc + zipinc_qrtl, data = imp_4, x = TRUE, 
     y = TRUE)
 
                        Model Likelihood    Discrimination    Rank Discrim.    
                              Ratio Test           Indexes          Indexes    
 Obs         16884    LR chi2     697.23    R2       0.066    C       0.650    
  0          13804    d.f.            24    g        0.615    Dxy     0.299    
  1           3080    Pr(> chi2) <0.0001    gr       1.850    gamma   0.299    
 max |deriv| 2e-12                          gp       0.088    tau-a   0.089    
                                            Brier    0.143                     
 
                        Coef    S.E.   Wald Z Pr(>|Z|)
 Intercept               0.1123 0.0996   1.13 0.2594  
 subst_abuse=yes        -0.3204 0.0424  -7.56 <0.0001 
 region=Northeast       -0.2455 0.0627  -3.91 <0.0001 
 region=South            0.1800 0.0605   2.98 0.0029  
 region=West             0.2565 0.0663   3.87 0.0001  
 region=Midwest         -0.1298 0.0730  -1.78 0.0754  
 age                    -0.0290 0.0017 -16.60 <0.0001 
 sex=female             -0.1239 0.0463  -2.68 0.0075  
 insurance=Medicare     -0.3858 0.0537  -7.19 <0.0001 
 insurance=Private      -0.0152 0.0605  -0.25 0.8015  
 insurance=Self_pay      0.0186 0.0775   0.24 0.8103  
 insurance=Other        -0.0424 0.1179  -0.36 0.7193  
 race=White             -0.0042 0.0525  -0.08 0.9360  
 race=Hispanic           0.1574 0.0597   2.64 0.0084  
 race=Other              0.1369 0.1113   1.23 0.2185  
 race=Asian              0.4061 0.1838   2.21 0.0272  
 race=NativeA            0.2713 0.2816   0.96 0.3354  
 patient_loc=Fringe      0.1544 0.0587   2.63 0.0085  
 patient_loc=metro>250K -0.0189 0.0589  -0.32 0.7485  
 patient_loc=metro>50K  -0.0123 0.0948  -0.13 0.8967  
 patient_loc=micro      -0.0679 0.1119  -0.61 0.5438  
 patient_loc=Other       0.0478 0.1398   0.34 0.7326  
 zipinc_qrtl=48-61K      0.0360 0.0525   0.69 0.4930  
 zipinc_qrtl=61-82K      0.0380 0.0607   0.63 0.5311  
 zipinc_qrtl=82K+       -0.0073 0.0717  -0.10 0.9186

5.5.4 Confusion Matrix

Below is the code to augment mod1_glm in order to get the predicted values (still within the training sample)

hiv1_aug <- augment(mod1_glm, imp_4, type.predict = "response")

I have plotted mod1_glm fits by observed AIDS status.

ggplot(hiv1_aug, aes(x = factor(AIDS), y = .fitted, col = factor(AIDS))) + geom_boxplot() +
geom_jitter(width = 0.1) + guides(col = FALSE) +
  labs(title = "mod1 fits by observed AIDS status (n=16,884)", footnote= "Highest predicted value <0.5", x= "mod1 fitted probabilities")

Overall, the predicted probabilities is higher for those who actually had an AIDS defining illness. It is important to note that our highest predicted probability does not reach 0.5, so we cannot use that as our cutoff for the confusion matrix. I must make the cutoff something lower.

Below is the confusion matrix (caretpackage). Rather than setting the cutoff at 0.5, I set it at 0.27 after evaluating the plot above.

cmatrix <- hiv1_aug %$%
  caret::confusionMatrix(
    data = factor(.fitted >= 0.27), 
    reference = factor(AIDS == 1), 
    positive = "TRUE"
  ) 

cmatrix

Confusion Matrix and Statistics

          Reference
Prediction FALSE  TRUE
     FALSE 12165  2301
     TRUE   1639   779
                                         
               Accuracy : 0.7666         
                 95% CI : (0.7602, 0.773)
    No Information Rate : 0.8176         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.1464         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.25292        
            Specificity : 0.88127        
         Pos Pred Value : 0.32217        
         Neg Pred Value : 0.84094        
             Prevalence : 0.18242        
         Detection Rate : 0.04614        
   Detection Prevalence : 0.14321        
      Balanced Accuracy : 0.56709        
                                         
       'Positive' Class : TRUE

Key results of the confusion matrix include:

sensitivity: 0.25
specificity: 0.88
positive predictive value: 0.32

NOTE: I tried determining the optimal cutoff using the Youden index. This is used when maximizing specificity and sensitivity are equally desirable. The index is the point that has minimum distance from ROC curve’s (1, 1) point.I attempted approaches from the OptimalCutpoint, and cutpointr packages. I also attempted to find it visually on a curve where sensitivity and specificity intersected. None of the attempts worked, so I moved on.

Of the many sources, I liked this example code the most: https://rpubs.com/harshaash/logistic_regression

# sens_spec_plot <- function(actual_value, positive_class_name, negitive_class_name, hiv1_aug ){
#   # Initialising Variables
#   specificity <- c()
#   sensitivity <- c()
#   cutoff <- c()
#   
#   for (i in 1:100) {
#     predList <- as.factor(ifelse(hiv1_aug  >= i/100, positive_class_name, negitive_class_name))
#     specificity[i] <- specificity(predList, actual_value)
#     sensitivity[i] <- sensitivity(predList, actual_value)
#     cutoff[i] <- i/100
#   }
#   df.sens.spec <- as.data.frame(cbind(cutoff, specificity, sensitivity))
#   
#   ggplot(df.sens.spec, aes(x = cutoff)) +
#     geom_line(aes(y = specificity, color = 'Specificity')) +
#     geom_line(aes(y = sensitivity, color = 'Sensitivity'))+
#     labs(x = 'Cutoff p value', y='Sens/Spec',  title = 'Sensitivity-Specificity plot',fill = 'Plot') +
#       theme_minimal()+ theme(legend.position="bottom")
# }
# 
# sens_spec_plot(actual_value = imp_4$AIDS, positive_class_name = '', negitive_class_name = 'O', pred_probability = imp_4$pred_probability_I)

# AIDS_f <- imp_4 %>% 
#   mutate(AIDS_F = fct_recode(factor(AIDS),
# "no" = "0", "yes" = "1")) %>% 
#   mutate(SUD = fct_recode(factor(subst_abuse),
#                 "1" = "yes",
#                 "0" = "no")) %>% 
#   mutate(SUD = as.numeric(SUD))

# opt_cut2 <- cutpointr(AIDS_f, SUD, AIDS_f, direction = ">=", pos_class = "yes",
#                      neg_class = "no", method = maximize_metric, metric = youden)

# opt_cut <- cutpointr(imp_4, subst_abuse, AIDS_f, direction = ">=", pos_class = "yes",
#                      neg_class = "no", method = maximize_metric, metric = youden)

5.5.5 Nonlinearity

Below is the Spearman rho squared plot to evaluate the predictive punch of each of my variables in mod1

spear_mod1 <- spearman2(AIDS ~ subst_abuse + region + age + sex + insurance + race + patient_loc + zipinc_qrtl, data = imp_4)

plot(spear_mod1)

The Spearman rho-squared plot suggests that I have the most predictive punch with age and insurance. I will:

create an interaction between age and insurance
add a restricted cubic spline for age (4 knots)

5.6 `mod1` comparison to nonlinear models

modifications to mod1

mod1b:
- restricted cubic spline of 4 knots with age
- interaction between age and insurance
mod1c
- restricted cubic spline of 4 knots with age
mod1d
- interaction between age and insurance

I am fitting these models with both glm and lrmas seen below:

mod1b_lrm <- lrm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)

mod1c_lrm <- lrm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)

mod1d_lrm <- lrm(AIDS ~ subst_abuse + region + age + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4 , x = TRUE, y = TRUE)

mod1b_glm <- glm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4)

mod1c_glm <- glm(AIDS ~ subst_abuse + region + rcs(age,4) + insurance  + sex + race + patient_loc + zipinc_qrtl, data = imp_4)

mod1d_glm <- glm(AIDS ~ subst_abuse + region + age + insurance + age%ia%insurance + sex + race + patient_loc + zipinc_qrtl, data = imp_4)

5.6.1 Summary Statistics for in sample fit

The AIC & BIC are the best for mod1c

bind_rows(glance(mod1_glm), glance(mod1b_glm), glance(mod1c_glm), glance(mod1d_glm)) %>% 
  mutate(model = c("1", "1b", "1c", "1d")) %>%
select(model, nobs, deviance, df.residual, AIC, BIC) %>%
  kable(digits = 0)

model	nobs	deviance	df.residual	AIC	BIC
1	16884	2415	16859	15135	15336
1b	16884	2411	16853	15118	15366
1c	16884	2412	16857	15113	15329
1d	16884	2414	16855	15137	15369

5.6.2 Comparison of Models With ANOVA

Since mod1 is a subset of mod1b, mod1c, and mod1d, I can compare these models with ANOVA tests.

5.6.2.1 `mod1` vs `mod1b`

anova(mod1_glm, mod1b_glm)

Analysis of Deviance Table

Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + rcs(age, 4) + insurance + age %ia% 
    insurance + sex + race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16859     2415.3            
2     16853     2411.2  6   4.1426

The addition of a restricted cubic spline with 4 knots for age and an interaction between age and insurance reduces the lack of fit by 4.1426 points, at a cost of 6 degrees of freedom. Thus, the fuller model may not be an improvement.

5.6.2.2 mod1 vs mod1c

anova(mod1_glm, mod1c_glm)

Analysis of Deviance Table

Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + rcs(age, 4) + insurance + sex + 
    race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16859     2415.3            
2     16857     2411.6  2   3.7793

The addition of a restricted cubic spline with 4 knots for age and reduced the lack of fit by 3.7793 points, at a cost of 2 degrees of freedom. Thus, this nonlinear model may not be an improvement.

5.6.2.3 mod1 vs mod1d

anova(mod1_glm, mod1d_glm)

Analysis of Deviance Table

Model 1: AIDS ~ subst_abuse + region + age + sex + insurance + race + 
    patient_loc + zipinc_qrtl
Model 2: AIDS ~ subst_abuse + region + age + insurance + age %ia% insurance + 
    sex + race + patient_loc + zipinc_qrtl
  Resid. Df Resid. Dev Df Deviance
1     16859     2415.3            
2     16855     2414.5  4  0.85066

The addition of an interaction between age and insurance reduced the lack of fit by 0.85066 points, at a cost of 2 degrees of freedom. Thus, the linear model may not be an improvement.

5.6.2.4 Comparing Validated Nagelkerke R-square and C statistic

I used the validate command on each of the models, which provided me with the following results: - R2: higher better - brier score: lower = better (calibration) - C statistic: higher=better

Note: the validated results (table 2) holds more weight when choosing models because it predicts how the model would perform out of sample.

Table 1: Index Fit Statistics Comparing mod1 to Nonlinear Models

Index Summary	`mod1`	`mod1b`	`mod1c`	`mod1d`
Index Nagelkerke \(R^2\)	0.066	0.069	0.069	0.066
Index Brier Score	0.143	0.143	0.143	0.143
Index C	`0.64955	0.6521	0.6519	0.6499

mod1b, mod1c, and mod1d have equal index rsquared (slightly better than mod1 though)
Brier score not useful (all equal)
mod1 has the best C statistic, but negligible

Table 2: Validated Fit Statistics Comparing mod1 to Nonlinear Models

Corrected Summary	`mod1`	`mod1b`	`mod1c`	`mod1d`
Corrected Nagelkerke \(R^2\)	0.0614	0.0630	0.0634	0.0623
Corrected Brier Score	0.143	0.143	0.143	0.143
Corrected C	0.64495	0.64625	0.64635	0.6458

Corrected Nagelkerke \(R^2\) : mod1c
Corrected Brier Score: all equal
Corrected C: mod1c (but mod1b, mod1c, and mod1d are equal to 3 decimal points)

Overall winner: mod1c - although mod1c is negligibly better, its rsquared is very close to 0 and its C statistic shows that the model does not perform much better than guessing. Thus, this is an extremely weak model.

The results shown in table 1 and table 2 were obtained from:

validate(mod1_lrm)

          index.orig training    test optimism index.corrected  n
Dxy           0.2991   0.3043  0.2950   0.0093          0.2898 40
R2            0.0660   0.0683  0.0641   0.0042          0.0617 40
Intercept     0.0000   0.0000 -0.0467   0.0467         -0.0467 40
Slope         1.0000   1.0000  0.9690   0.0310          0.9690 40
Emax          0.0000   0.0000  0.0156   0.0156          0.0156 40
D             0.0412   0.0428  0.0400   0.0028          0.0385 40
U            -0.0001  -0.0001  0.0000  -0.0002          0.0000 40
Q             0.0414   0.0429  0.0400   0.0029          0.0384 40
B             0.1428   0.1430  0.1430  -0.0001          0.1429 40
g             0.6150   0.6272  0.6062   0.0210          0.5940 40
gp            0.0883   0.0899  0.0870   0.0029          0.0853 40

validate(mod1b_lrm)

          index.orig training    test optimism index.corrected  n
Dxy           0.3042   0.3091  0.2991   0.0100          0.2942 40
R2            0.0689   0.0718  0.0664   0.0053          0.0636 40
Intercept     0.0000   0.0000 -0.0612   0.0612         -0.0612 40
Slope         1.0000   1.0000  0.9616   0.0384          0.9616 40
Emax          0.0000   0.0000  0.0201   0.0201          0.0201 40
D             0.0431   0.0450  0.0415   0.0035          0.0396 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0432   0.0452  0.0415   0.0037          0.0396 40
B             0.1425   0.1428  0.1428   0.0000          0.1425 40
g             0.6437   0.6554  0.6293   0.0262          0.6175 40
gp            0.0910   0.0930  0.0893   0.0037          0.0873 40

validate(mod1c_lrm)

          index.orig training    test optimism index.corrected  n
Dxy           0.3038   0.3096  0.2988   0.0107          0.2931 40
R2            0.0687   0.0717  0.0663   0.0054          0.0633 40
Intercept     0.0000   0.0000 -0.0534   0.0534         -0.0534 40
Slope         1.0000   1.0000  0.9601   0.0399          0.9601 40
Emax          0.0000   0.0000  0.0187   0.0187          0.0187 40
D             0.0430   0.0449  0.0415   0.0035          0.0395 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0431   0.0450  0.0414   0.0037          0.0395 40
B             0.1425   0.1421  0.1428  -0.0007          0.1432 40
g             0.6408   0.6544  0.6272   0.0272          0.6136 40
gp            0.0909   0.0926  0.0891   0.0034          0.0874 40

validate(mod1d_lrm)

          index.orig training    test optimism index.corrected  n
Dxy           0.2998   0.3064  0.2949   0.0115          0.2883 40
R2            0.0664   0.0697  0.0640   0.0057          0.0607 40
Intercept     0.0000   0.0000 -0.0586   0.0586         -0.0586 40
Slope         1.0000   1.0000  0.9572   0.0428          0.9572 40
Emax          0.0000   0.0000  0.0204   0.0204          0.0204 40
D             0.0415   0.0436  0.0400   0.0036          0.0379 40
U            -0.0001  -0.0001  0.0001  -0.0002          0.0001 40
Q             0.0416   0.0437  0.0399   0.0038          0.0378 40
B             0.1427   0.1423  0.1430  -0.0008          0.1435 40
g             0.6243   0.6394  0.6104   0.0290          0.5953 40
gp            0.0887   0.0906  0.0870   0.0036          0.0851 40

5.6.3 Metrics for test sample

Below I am fitting the 4 models to the testing sample, imp_test.

mod1_aug_test <- augment(mod1_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))

mod1b_aug_test <- augment(mod1b_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))

mod1c_aug_test <- augment(mod1c_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))

mod1d_aug_test <- augment(mod1d_glm,
newdata = imp_test,
type.predict = "response") %>% 
  mutate(obs = factor(AIDS),
pred = factor(ifelse(.fitted >= 0.27, 1, 0)))

NOTE: Rather than setting the cutoff at 0.5, I set it at 0.27 (based on plot in the confusion matrix section). We do not have any predicted values as high as 0.5, so I needed to make it something lower. This subjectivity is a major limitation

Below is a table that compares the kappa and accuracy for each of the models in the holdout sample.

 comp <- bind_cols(
yardstick::metrics(data = mod1_aug_test,
        truth = obs, estimate = pred)%>% 
  select(.metric, mod1 = .estimate),
yardstick::metrics(data = mod1b_aug_test,
        truth = obs, estimate = pred) %>%
   select(mod1b = .estimate), 
yardstick::metrics(data = mod1c_aug_test,
        truth = obs, estimate = pred) %>% 
  select(mod1c = .estimate), 
  yardstick::metrics(data = mod1d_aug_test,
        truth = obs, estimate = pred) %>% 
  select(mod1d = .estimate))

comp %>% kable(dig=3)

.metric	mod1	mod1b	mod1c	mod1d
accuracy	0.769	0.762	0.762	0.767
kap	0.152	0.164	0.163	0.154

Accuracy:

Best for mod1 (which makes sense because the cutoff was set based on this model)
- only 77% of estimates correct is not very good
Second best for mod1d

Kappa (measure of inter-rater reliability, perfect agreement=1)

best for mod1b
- Kappa is basically the strength of the correlation between what we predicted and what the actual was. 0.161 is a really weak correlation

5.6.4 Fit Statistics in test sample

5.7 Final Model

I prefer mod1 based on the similar results for fit quality and its lower complexity:

overall assessment of fit quality:

In sample fit statistics
- AIC, BIC: mod1c wins
- ANOVA: none of the models resulted in a drop in deviance in comparison to mod1 (this finding doesn’t hold much weight)
- R2: all were equal when rounding to 2 decimal points (0.07)
- Brier score: all identical to 3 decimal points (0.143)
- C: all identical when rounding to 2 decimal points (0.65)
Validated fit statistics
- R2: all were equal when rounding to 2 decimal points (0.06)
- Brier score: all identical to 3 decimal points (0.143)
- C: all very similar
  - mod1b and mod1c were the highest: 0.646
  - mod1and mod1d were slightly lower: 0.645

Not worth the complication of adding non-linear terms

According to the ANOVA tests, each of the models with nonlinear terms reduced the lack of fit in comparison to the original model
There was no improvement in predicting as we can see by the lack of substantial improvement in the accuracy or kappa when evaluating the model out of sample

Non statistical considerations: we want simple models.

5.8 Model Parameters

Below is a listing of the model parameters for mod1 fit to the entire data set (after multiple imputation) in terms of odds ratios, with 95% confidence intervals

m1_pool <- pool(m1_mods)
sum1 <- summary(m1_pool, exponentiate = TRUE, 
        conf.int = TRUE, conf.level = 0.95) %>% 
    select(-df) 

sum1 %>% kable(digits = c(3, 3, 2, 2, 3, 3))

term	estimate	std.error	statistic	p.value	2.5 %	97.5 %
(Intercept)	1.121	0.10	1.15	0.250	0.922	1.363
subst_abuseyes	0.726	0.04	-7.54	0.000	0.668	0.789
regionNortheast	0.781	0.06	-3.94	0.000	0.691	0.883
regionSouth	1.197	0.06	2.98	0.003	1.063	1.348
regionWest	1.290	0.07	3.84	0.000	1.133	1.470
regionMidwest	0.878	0.07	-1.79	0.074	0.761	1.013
age	0.971	0.00	-16.63	0.000	0.968	0.975
sexfemale	0.884	0.05	-2.67	0.008	0.807	0.968
insuranceMedicare	0.682	0.05	-7.13	0.000	0.614	0.758
insurancePrivate	0.986	0.06	-0.23	0.820	0.876	1.111
insuranceSelf_pay	1.017	0.08	0.22	0.826	0.874	1.184
insuranceOther	0.959	0.12	-0.35	0.725	0.761	1.209
raceWhite	1.000	0.05	0.00	0.997	0.902	1.109
raceHispanic	1.175	0.06	2.70	0.007	1.045	1.322
raceOther	1.145	0.11	1.22	0.224	0.920	1.426
raceAsian	1.496	0.18	2.19	0.029	1.043	2.146
raceNativeA	1.328	0.28	1.02	0.308	0.769	2.291
patient_locFringe	1.170	0.06	2.66	0.008	1.042	1.313
patient_locmetro>250K	0.986	0.06	-0.23	0.815	0.878	1.108
patient_locmetro>50K	0.981	0.10	-0.20	0.838	0.813	1.183
patient_locmicro	0.942	0.11	-0.53	0.594	0.756	1.174
patient_locOther	1.029	0.14	0.20	0.842	0.779	1.358
zipinc_qrtl48-61K	1.025	0.05	0.45	0.651	0.921	1.141
zipinc_qrtl61-82K	1.028	0.06	0.45	0.651	0.911	1.160
zipinc_qrtl82K+	1.005	0.07	0.07	0.944	0.873	1.158

After adjusting for region, age, sex, insurance, race, zipinc_qrtl, the odds of an AIDS defining illness in PLWH with a substance disorder is 0.726 95% CI ( 0.668 to 0.789) time the odds in those without a substance use disorder.
- given that the 95% CI is entirely below 1, the model suggests that having a SUD is associated with a lower odds of an AIDS defining illness

5.9 Effect sizes

Below are the effect sizes for all elements of mod1 both numerically and graphically.

plot(summary(mod1_lrm))

kable(summary(mod1_lrm, conf.int=0.95), digits=3)

	Low	High	Diff.	Effect	S.E.	Lower 0.95	Upper 0.95	Type
age	40	58	18	-0.523	0.031	-0.584	-0.461	1
Odds Ratio	40	58	18	0.593	NA	0.557	0.631	2
subst_abuse - yes:no	1	2	NA	-0.320	0.042	-0.404	-0.237	1
Odds Ratio	1	2	NA	0.726	NA	0.668	0.789	2
region - Northeast:South_Atlantic	1	2	NA	-0.245	0.063	-0.368	-0.123	1
Odds Ratio	1	2	NA	0.782	NA	0.692	0.885	2
region - South:South_Atlantic	1	3	NA	0.180	0.060	0.061	0.299	1
Odds Ratio	1	3	NA	1.197	NA	1.063	1.348	2
region - West:South_Atlantic	1	4	NA	0.257	0.066	0.127	0.386	1
Odds Ratio	1	4	NA	1.292	NA	1.135	1.472	2
region - Midwest:South_Atlantic	1	5	NA	-0.130	0.073	-0.273	0.013	1
Odds Ratio	1	5	NA	0.878	NA	0.761	1.013	2
sex - female:male	1	2	NA	-0.124	0.046	-0.215	-0.033	1
Odds Ratio	1	2	NA	0.883	NA	0.807	0.967	2
insurance - Medicare:Medicaid	1	2	NA	-0.386	0.054	-0.491	-0.281	1
Odds Ratio	1	2	NA	0.680	NA	0.612	0.755	2
insurance - Private:Medicaid	1	3	NA	-0.015	0.060	-0.134	0.103	1
Odds Ratio	1	3	NA	0.985	NA	0.875	1.109	2
insurance - Self_pay:Medicaid	1	4	NA	0.019	0.077	-0.133	0.170	1
Odds Ratio	1	4	NA	1.019	NA	0.875	1.186	2
insurance - Other:Medicaid	1	5	NA	-0.042	0.118	-0.274	0.189	1
Odds Ratio	1	5	NA	0.959	NA	0.761	1.208	2
race - White:Black	1	2	NA	-0.004	0.052	-0.107	0.099	1
Odds Ratio	1	2	NA	0.996	NA	0.898	1.104	2
race - Hispanic:Black	1	3	NA	0.157	0.060	0.040	0.275	1
Odds Ratio	1	3	NA	1.171	NA	1.041	1.316	2
race - Other:Black	1	4	NA	0.137	0.111	-0.081	0.355	1
Odds Ratio	1	4	NA	1.147	NA	0.922	1.426	2
race - Asian:Black	1	5	NA	0.406	0.184	0.046	0.766	1
Odds Ratio	1	5	NA	1.501	NA	1.047	2.152	2
race - NativeA:Black	1	6	NA	0.271	0.282	-0.281	0.823	1
Odds Ratio	1	6	NA	1.312	NA	0.755	2.278	2
patient_loc - Fringe:Central	1	2	NA	0.154	0.059	0.039	0.269	1
Odds Ratio	1	2	NA	1.167	NA	1.040	1.309	2
patient_loc - metro>250K:Central	1	3	NA	-0.019	0.059	-0.134	0.097	1
Odds Ratio	1	3	NA	0.981	NA	0.874	1.101	2
patient_loc - metro>50K:Central	1	4	NA	-0.012	0.095	-0.198	0.174	1
Odds Ratio	1	4	NA	0.988	NA	0.820	1.189	2
patient_loc - micro:Central	1	5	NA	-0.068	0.112	-0.287	0.151	1
Odds Ratio	1	5	NA	0.934	NA	0.750	1.163	2
patient_loc - Other:Central	1	6	NA	0.048	0.140	-0.226	0.322	1
Odds Ratio	1	6	NA	1.049	NA	0.798	1.380	2
zipinc_qrtl - 48-61K:<48K	1	2	NA	0.036	0.053	-0.067	0.139	1
Odds Ratio	1	2	NA	1.037	NA	0.935	1.149	2
zipinc_qrtl - 61-82K:<48K	1	3	NA	0.038	0.061	-0.081	0.157	1
Odds Ratio	1	3	NA	1.039	NA	0.922	1.170	2
zipinc_qrtl - 82K+:<48K	1	4	NA	-0.007	0.072	-0.148	0.133	1
Odds Ratio	1	4	NA	0.993	NA	0.863	1.142	2

Substance Abuse

if we have two subjects, Al and Bob, who have the same values for region, age, sex, insurance, race, patient_loc, and zipinc_qrtl, but Al has a SUD and Bob does not, then mod1 projects that Al’s odds of having an AIDS defining illness will be 0.726 times Bob’s odds of having an AIDS defining illness.
a comorbid substance use disorders appears to be associated with decreasing odds of having an AIDS defining illness. Note, too, that the effect of subst_abuse=yes on the odds of AIDS has a confidence interval for the odds ratio entirely below 1

other interesting findings

Age

if we have two subjects, Al and Bob who have the same values for subst_abuse, region, sex, insurance, race, patient_loc, and zipinc_qrtl, but Al is age 40 and Bob is age 58, then mod1 projects that Bob’s odds of having an AIDS defining illness will be 0.593 times Al’s odds of having an AIDS defining illness. Bob’s odds are 59.3% as large as Al’s, equivalently.
increasing age appears to be associated with decreasing odds of having an AIDS defining illness. Note, too, that the effect of age on the odds of AIDS has a confidence interval for the odds ratio entirely below 1

region

if we have two subjects, Al and Bob, who have the same values for subst_abuse, age, sex, insurance, race, patient_loc, and zipinc_qrtl, but Al lives on the West Coast and Bob lives in the South Atlantic, then mod1 projects that Al’s odds of having an AIDS defining illness will be 1.292 times Bob’s odds of having an AIDS defining illness. Bob’s odds are 29.2% higher than Al’s, equivalently.

sex

if we have two subjects, Lindsay and Luke, who have the same values for subst_abuse, region, age, insurance, race, patient_loc, and zipinc_qrtl, but Lindsay is female and Luke is male, mod1 projects that Lindsay’s odds of having an AIDS defining illness will be 0.883 times Luke’s odds of having an AIDS defining illness.
female sex appears to be associated with decreasing odds of having an AIDS defining illness. Note, too, that the effect of sex=female on the odds of AIDS has a confidence interval for the odds ratio entirely below 1

Insurance

if we have two subjects, Al and Bob, who have the same values for subst_abuse, region, sex, age, race, patient_loc, and zipinc_qrtl, but Al has Medicare and Bob has Medicaid, then mod1 projects that Bob’s odds of having an AIDS defining illness will be 0.680 times Al’s odds of having an AIDS defining illness.
having medicare as the primary payer vs medicaid as the primary payer appears to be associated with decreasing odds of having an AIDS defining illness. Note, too, that the effect of medicare vs medicaid on the odds of AIDS has a confidence interval for the odds ratio entirely below 1

I think its interesting that the data shows that having medicaid is associated with increased odds of having an AIDS defining illness when compared to all other primary payers (except self pay). However, the odds ratio crosses 1 for all but Medicare vs Medicaid.

race

if we have two subjects, Al and Bob, who have the same values for subst_abuse, age, sex, insurance, region, patient_loc, and zipinc_qrtl, but Al is Hispanic and Bob is Black, then mod1 projects that Al’s’s odds of having an AIDS defining illness will be 1.171 times Bob’s odds of having an AIDS defining illness.
Being Hispanic appears to be associated with increasing odds of having an AIDS defining illness. Note, too, that the effect of Hispanic race on the odds of AIDS has a confidence interval for the odds ratio entirely above 1

patient location

if we have two subjects, Al and Bob, who have the same values for subst_abuse, age, sex, insurance, region, race, and zipinc_qrtl, but Al lives in a fringe county and Bob lives in a Central county, then mod1 projects that Al’s odds of having an AIDS defining illness will be 1.167 times Bob’s odds of having an AIDS defining illness.
other than living in a fringe county versus a central county, there was no substantial separation between the population density of where a patient lives and their odds of having AIDS.

income

the effect of zipinc_qrtl was not meaningful.
- The point estimate of the OR was close to 1 for all levels
- The 95% CI crossed 1 for all levels
Conclusion: the data suggests that median household income based on zip code has a substantial effect on whether PLWH have an AIDS defining illness.
if we have two subjects, Al and Bob, who have the same values for subst_abuse, age, sex, insurance, region, race, and patient_loc, but Al lives in a fringe county and Bob lives in a Central county, then mod1 projects that Al’s odds of having an AIDS defining illness will be 1.167 times Bob’s odds of having an AIDS defining illness.
other than living in a fringe county versus a central county, there was no substantial separation between the population density of where a patient lives and their odds of having AIDS.

6 Conclusion for Analysis 1

My first research question was, “In PLWH in 2018, how does hospitalization due to an AIDS defining illnesses in those with a SUD compare to those without a SUD?” This is an important question because the prevalence of SUD is high in PLWH (estimated 48%) and it has shown to have deleterious impacts on medication adherence, retention to services, time to diagnosis, and care linkage. AIDS defining illnesses are an indication of disease progression, thus it would be valuable to quantify the effect SUD has on disease progression. There have been 3 cohort studies (post HAART era) that have evaluated this relationship, each of which found a higher burden of AIDS defining illnesses in those with SUD. However, these cohort use data from years 2003 and 2004. Thus, these studies were conducted before the introduction of integrase strand inhibitors and single tablet regimens, both of which have greatly impacted adherence and viral load suppression. Furthermore, none of these studies were nationally representative.

My model indicated that after adjusting for demographics and socioeconomic factors, the odds of hospitalization due to an AIDS defining illness in those with SUD was 0.726 95% CI (0.668 to 0.789). Clearly these results do not match the previous findings. However, my study differs from the previous studies in some important ways. The previous studies:

Adjusted for information that I did not have access to such as lab values ( i.e. CD4 count, viral load), opportunistic infection prophylaxis, and sharing needles
Had different definitions of AIDS defining illness (eg Lucas et al required that PCP and candida esophagitis had to be recurrent in order to be considered an AIDS defining opportunistic infection)
Had different definitions of SUD: the other study’s definitions of substance abuse disorder only included cocaine/crack or heroin. My definition of substance use disorder was much broader and included abuse of alcohol, opioids, sedatives, hypnotics, anxiolytics, cocaine, other stimulants, hallucinogens, inhalants, or other psychoactive substances/multiple drug use.
Were not nationally representative: the study by Cook et al (n=1,686) and Anastos et al (n=961) only included women, and the study by Lucas et al. only included people from Maryland (n=1,851). My study was nationally representative of all U.S. hospitalizations in 2018 (n=24,118)

Limitations

This data was collected for the purpose of reimbursement, not research. Thus, conditions may not have been coded if there was no reimbursement associated with them.
I could not adjust for the patients HIV medication regimen, previous history of opportunistic infections, or opportunistic infection prophylaxis
Lack of granularity of the ICD10 codes
I could have adjusted for immunocompromising comorbidities
My final model was very weak (r2=0.07, C statistic=0.65)
The assessment of the model’s accuracy and kappa value were based on an arbitrary cut off of 0.27
I had a lot of trouble determining the Youden Index. A major limitation to my confusion matrix was the arbitrary cutoff point of 0.27. Determining the Youden Index would allow me to find the point where both sensitivity and specificity are maximized.

Strengths of my study:

Nationally representative with a large sample size
Very low percentage of missing values (maximum was 3.5%) and I used multiple imputation to deal with the missing values
Evaluated multiple different models with different combinations of nonlinear terms
Adjusted for demographics and socioeconomic status

My next steps are:

adjust for immunocompromising conditions
include propensity score matching

age, gender, race, total comorbidity number (not sure how to get that though),# of procedures, admission type, insurance, income quartile, hospital bed size, location, hospital teaching status (HCUP-NIS paper matched on https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5514640/)

refine the definition of substance use disorder to stimulant use only (I read that stimulant use is associated with increased HIV viral replication and there are a multitude of studies showing the relationship between increased viral load and stimulant use)
Do a stability analysis where I just do complete cases

What I have learned about statistics/data science:

How to do multiple imputation with mice
This was my first time working with a really big data set in 431/432 and it was interesting to see how it affects the precision of estimates
I liked using the strata option when splitting my sample (first time using it).
How to mentally process and emotionally cope with results that do not match up with what you are expecting to happen. I will use some more researchers degrees of freedom with my exposure and outcome definition (to match previous studies better)
paste(colnames(hiv), collapse = " , “) or paste(colnames(hiv), collapse =” + ") is like one of the most helpful tools ever
You can save a lot of trouble with conflicts by just loading packages that you only expect to use once or twice and just specify them when you need them.
Figures speak a lot more than tables!! Use them WHENEVER possible!!! Especially for effect sizes. Except for the ROC curve (one of the few exceptions to a graph not being that helpful)

7 Analysis 2: Length of Stay (Count Outcome)

7.1 Plans

7.1.1 Count Outcome

My count outcome is los
There are no missing cases on this outcome

hiv %>% select(los) %>% miss_case_table() %>% kable(dig=0)

n_miss_in_case	n_cases	pct_cases
0	24118	100

7.1.2 Planned Predictors

Like analysis 1, I will be using subst_abuse (main predictor), age, race, region, zipinc_qrtl, and insurance
Additionally I will be using ED_record and AIDS_f

7.2 Imputation

I will do simple imputation before splitting the data.

The mice package will do all of the imputation for me. I will just pull out one data set.

set.seed(99)
hiv_mice2 <- mice(hiv, m = 1, printFlag = FALSE)

I will now store the 1st imputed data set inhiv2

hiv2 <- complete(hiv_mice2, 1) %>% tibble() %>% select(key_nis, los, subst_abuse, AIDS_f, age, race,  region,  zipinc_qrtl,  insurance, ED_record)

dim(hiv2)

[1] 24118    10

NOTE: I wanted to use the full list of variables for imputing, but hiv2 only has the variables I require for this analysis

And I do not have any more missing!

n_miss(hiv2)

[1] 0

7.3 Splitting the data (again)

I am splitting the singly imputed hiv2 sample into a training (70% of the data) and testing sample (30% of the data). I am using the function strata to ensure that both data sets have an equal proportion of my main predictors of interest, subst_abuse and AIDS_f

set.seed(1)
hiv_split2 <- rsample::initial_split(hiv2, prop = 0.7,
strata = subst_abuse, AIDS_f)
hiv_train2 <- rsample::training(hiv_split2) 
hiv_test2 <- rsample::testing(hiv_split2)

7.4 Exploratory Analyses

7.4.1 distribution of `los`

The distribution of los is shown below with the following histogram:

ggplot(hiv_train2, aes(x = los)) +
    geom_histogram(fill = "slateblue", col = "white", 
                   binwidth = 2) + 
  labs(title = "Distribution of length of inpatient stay of HIV patients", subtitle = "24081 observations from NIS 2018 data",
x = "Length of stay (days)", y = "Count")

We can see that length of stay is a count variable and follows the count properties that it is:

only positive
does not follow a Normal distribution

Thus, we cannot model this variable with linear regression because linear regression (1) assumes normal distribution and (2) would estimate some subjects as having negative counts. My conclusion from this figure is that in order to analyze this outcome, I must fit a general linear model that will allow for Poisson or negative binomial distribution.

We can also see that the most common length of stay was 1 day, but there are a lot of people with a stay of 0, 2, and 3 days. There are a reasonable number of people with up to about 20 days in the hospital. The maximum is 257 days, which you cannot even see. This sort of skewed data is very common with counts.

7.4.2 Distribution by SUD status

Here we can compare the distribution of los bysubst_abuse

ggstatsplot::ggbetweenstats(
  data = I(hiv_train2), 
  x = subst_abuse,
  y = los,
  ylab = "length of stay (days)",
  xlab = "substance use disorder",
  title = "Distribution of length of stay by SUD status for HIV patients (2018 HCUP-NIS)",
  type = "np")

Important revelations from this figure:

Like in the histogram above, we can see that its important that we are using Poisson regression since we have all positive values
Most of the data for both groups are below 50 days, but the median is still super close to the bottom

according to the figure 5 days for no SUD and 4 days for SUD
- the separation of about 2 days is already indicating that there might be a difference between the two groups (just not in the direction I was expecting)
- Although one day may seem small and possibly not that meaningful, that could be associated with higher hospital costs and is still important.

Some extreme outliers (the super super high ones are a little bit more of a problem in the SUD group)
The groups look pretty well balanced in terms of observations in each group

7.4.3 5.3.1 Hmsic describe

There were 111 possible values
The highest possible values were 180 185 196 204 247
Half the people had 4 or fewer days in the hospital
75% of people had 8 or fewer days in the hospital
95% had 21 or fewer days in the hospital
More than 10% of the data was a count of 1 (a lot of 1’s)

hiv_train2 %$% Hmisc::describe(los)

los 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
   16883        0      111     0.99    7.026    7.162        1        1 
     .25      .50      .75      .90      .95 
       2        4        8       15       21 

lowest :   0   1   2   3   4, highest: 185 196 204 247 294

7.5 Fitting the models

There are two ways to evaluate count outcomes:

Poisson Approach
Negative Binomial Approach

These two regression approaches tend to under count the number of 0’s in the data. Therefore we can augment the Poisson or negative binomial model in two important ways:

Zero Inflation:

adds a logistic regression model to predict whether somebody is a 0 or not and then does a Poisson model to predict the counts
- ZIP: zero inflated Poisson model
- ZINB: zero inflated negative binomial model

Hurdle model

there are two processes:
- first process: whether you clear the hurdle of having a count bigger than 0
- second process: if the count is >0, then this process predicts the count

Because the distributions are different for Poisson regression and negative binomial regression, we can observe different curves/ relationships.

All models were fit using the pscl package

7.5.1 Fiting Poisson Model, `POIS`

We assume the count outcome, los, follows a Poisson distribution with a mean conditional on the predictors subst_abuse, AIDS_f, age, race, region, zipinc_qrtl, insurance, ED_record
The Poisson model uses a logarithm as its link function, so the model is actually predicting log(los)

POIS <- glm(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record ,
data = hiv_train2, family = "poisson")

7.5.2 Fitting Negative Binomial Model, `NB`

This model is a generalization of the Poisson model. It is different because it loosens the assumption about dispersion (variance can be bigger than the mean).
Here I am using the glm.nb function from the MASS package to fit the negative binomial model.
Like Poisson, it will also predict the logarithm of los.

NB <- MASS::glm.nb(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record,
              data = hiv_train2)

7.5.2.1 Theta

The estimated dispersion parameter value \(\theta\) is..

theta <- summary(NB)$theta

theta

[1] 1.482321

Since we are comparing this back to 1, the \(\theta\) is 48.232106 % higher than the Poisson model

7.5.3 Fitting Zero Inflated Poisson Model (ZIP)

ZIP involves two processes:

First a logistic regression model will predict excess 0’s.
Then everyone who is not predicted to have a 0 count will have their count predicted with poisson regression.

ZIP <- pscl::zeroinfl(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record,
                    data = hiv_train2)

Note: zeroinf defaults to a Poisson distribution.

7.5.4 Fitting Zero-Inflated Negative Binomial (`ZINB`) model

As in the ZIP, we assume there are two processes involved:

a logistic regression model is used to predict excess zeros
while a negative binomial model is used to predict the counts

ZINB <- zeroinfl(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record,
              dist = "negbin", data = hiv_train2)

NOTE: this time I specified that the distribution is negative binomial

7.5.5 Fitting a Hurdle Model / Poisson-Logistic (`hurdlePOIS` )

The interpretation of the hurdle model is that one process governs whether a patient has alos or not, and another process governs how many los are made.

hurdlePOIS <- hurdle(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record,
              dist = "poisson", zero.dist = "binomial", 
              data = hiv_train2)

7.6 Fitting a Hurdle Model / NB-Logistic (`hurdleNB`)

Like above, I am fitting another hurdle model, but this one uses a negative binomial distribution

hurdleNB <- hurdle(los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + insurance + ED_record,
              dist = "negbin", zero.dist = "binomial", 
              data = hiv_train2)

7.7 Comparison of rootograms (in sample performance)

A rootogram will be the key tool that I will use to determine whether the model fits well. I will also use a hypothesis testing approach to help me pick between zero inflated and hurdle versions of the Poisson model (as compared to just plain Poisson model) or for negative binomial choosing between one with zero inflation or without.

Interpreting the rootogram:

red line: what the model thinks is going to happen (the square root of the number of people who will fall in the 0,1,2 etc group)
grey bars: the height represents the number of observed counts
- bar below 0: under prediction (model predicts fewer of the values than the data shows)
- bar above 0: over prediction
- We want the bottom of all the bars to be at 0

7.7.1 Rootograms for Poisson, Negative Binomial, ZIP, and ZINB

I am comparing the rootograms for the POIS,NB, ZIP, and ZINB model.

Of note, Although we had observed los out to 248 days, none of the models predicted past 55 days, so I am showing 55 days as the maximum. The full rootograms can be seen in the next subsection

par(mfrow = c(2,2))

countreg::rootogram(POIS, max = 55, 
                    main = "Poisson")
countreg::rootogram(NB, max = 55, 
                    main = "Negative Binomial")
countreg::rootogram(ZIP, max = 55,
                    main = "ZIP")
countreg::rootogram(ZINB, max = 55,
                    main = "ZINB")

par(mfrow = c(1,1))

Without a doubt the models on the right that allow for dispersion (negative binomial) are much better than the models on the left

Model	Rootogram impressions
`POIS`	Many problems. Data appear overdispersed.(under predicts for counts 0-5 days, then over-predicts for counts between 6-13 days, predicts 14 and 15 days pretty well, then under predicts all remaining length of stays. Does not make any prediction after about 25 days)
`NB`	still many problems (over predicts zeros, still under predicts 2,3,4,5 days, then under over predicts up to day 15 but not as bad as poisson, does pretty well with 20-35, does not make any predictions past 55 days). Good things: predicts `los` of 1, 20, and 25 days really well, smaller gaps than the poisson model, and predicts higher counts
`ZIP`	All of the same problems as the poisson model, however it gets the counts for 0’s completely accurate (the point of zero inflation models)
`ZINB`	looks identical to the negative binomial model. The zeros weren’t inflated with `NB` (actually over predicted), so 0’s were over predicted here as well.

7.7.1.1 Complete Rootograms for Poisson, Negative Binomial, ZIP, and ZINB

Here I am just showing the rootograms out the the maximum observed los, 247 days.

par(mfrow = c(2,2))

countreg::rootogram(POIS, max = 247, 
                    main = "Poisson")
countreg::rootogram(NB, max = 247, 
                    main = "Negative Binomial")
countreg::rootogram(ZIP, max = 247,
                    main = "ZIP")
countreg::rootogram(ZINB, max = 247,
                    main = "ZINB")

par(mfrow = c(1,1))

We can really see just how awful these models were at predicting the higher numbers.

7.7.2 Poisson-Based Rootograms - Hurdle vs ZIP

Here I am comparing the two augmentations to the Poisson model which both corrected for the under predictions of 0’s

par(mfrow = c(2,1))

countreg::rootogram(ZIP, 
                    main = "ZIP")
countreg::rootogram(hurdlePOIS, 
                    main = "Poisson-Logistic Hurdle (hurdlePOIS)")

par(mfrow = c(1,1))

These models look identical. Identically Awful. So not much to choose from here

7.7.2.1 hypothesesis testing for Poisson based models

Here I will do a hypothesis test to see if the ZIP or hurdlePOIS models were actually improvements over the POIS model

vuong(POIS, ZIP)

NA or numerical zeros or ones encountered in fitted probabilities
dropping these 1 cases, but proceed with caution
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                  -10.180428 model2 > model1 < 2.22e-16
AIC-corrected         -9.837466 model2 > model1 < 2.22e-16
BIC-corrected         -8.511232 model2 > model1 < 2.22e-16

This hypothesis test is indicating that the ZIP model is actually an improvement over the POIS model

7.7.2.2 Vuong test: Comparing `ZIP` and `hurdlePOIS`

Since we saw that ZIP was an improvement, but looked identical to hurdlePOIS by the rootogram, I will use the vuong test to help make a decision.

vuong(ZIP, hurdlePOIS)

NA or numerical zeros or ones encountered in fitted probabilities
dropping these 1 cases, but proceed with caution
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw                    1.270333 model1 > model2 0.10198
AIC-corrected          1.270333 model1 > model2 0.10198
BIC-corrected          1.270333 model1 > model2 0.10198

With P=0.1019, it looks like this hypothesis test didn’t see a difference in the predicted probabilities for each count of these non-nested models.

7.7.3 NB-Based Rootograms - Which Looks Best?

Here I am comparing the two augmentations to the Negative Binomial which both had processes to predict whether someone had a zero day length of stay or not.

par(mfrow = c(2,1))

countreg::rootogram(ZINB, 
                    main = "ZINB")
countreg::rootogram(hurdleNB, 
                    main = "NB-Logistic Hurdle")

par(mfrow = c(1,1))

The rootogram for the hurdleNB model looks better than the ZINB model. It actually gets the zeros perfect! However, it does over predict 1’s which the ZINB model did great at getting right.

7.7.3.1 hypothesesis testing for NB based models

Here I will do a hypothesis test to see if the ZINB or hurdleNB models were actually improvements over the NB model

vuong(NB, ZINB)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw                3.174499e-04 model1 > model2 0.49987
AIC-corrected      1.475819e+05 model1 > model2 < 2e-16
BIC-corrected      7.182859e+05 model1 > model2 < 2e-16

This hypothesis test is indicating that the ZINB model is actually an improvement over the NB model

7.7.3.2 Vuong test: Comparing `ZINB` and `hurdleNB`

Since we saw that hurdleNB possibly looked better than ZINB, I am interested in seeing what the hypothesis test says

vuong(ZINB, hurdleNB)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -21.27955 model2 > model1 < 2.22e-16
AIC-corrected         -21.27955 model2 > model1 < 2.22e-16
BIC-corrected         -21.27955 model2 > model1 < 2.22e-16

The hypothesis test indicates that the predicted probabilities were not the same, so the hurdleNB model may be an improvement

In conclusion, the negative binomial models were considerably better than the Poisson models. However, they were unable to predict los values past 55 days and we had a great number of people with much higher los values than 55. The hurdle model was the best of the negative binomial based models in terms of predicting 0’s which makes this my favorite model. However, its unfortunate that it over predicted 1’s.

Additional consideration with the model choice: Although the hudrleNBmodel was much better at predicting 0’s, it didn’t do as well with higher values. In this case, it might actually be more desirable to choose the ZINB model because it is more desirable to predict longer los in terms of the viewpoint of prioritizing healthcare expenditure.

7.8 Fit statistics (in sample)

7.8.1 Store Training Sample Predictions

I am using augment to store the predictions for POIS and NB within hiv_train2.

I included "response" so that I am predicting los, not log(los).

POIS_aug <- augment(POIS, hiv_train2, 
                     type.predict = "response")

NB_aug <- augment(NB, hiv_train2, 
                     type.predict = "response")

We have no augment or other broom functions available for zero-inflated models, so I am storing the ZIP,ZINB, hurdlePOIS, and hurdleNB predictions like this:

ZIP_aug <- hiv_train2 %>%
    mutate(".fitted" = predict(ZIP, type = "response"),
           ".resid" = resid(ZIP, type = "response"))

ZINB_aug <- hiv_train2 %>%
    mutate(".fitted" = predict(ZINB, type = "response"),
           ".resid" = resid(ZINB, type = "response"))

hurdlePOIS_aug <- hiv_train2 %>%
    mutate(".fitted" = predict(hurdlePOIS, type = "response"),
           ".resid" = resid(hurdlePOIS, type = "response"))

hurdleNB_aug <- hiv_train2 %>%
    mutate(".fitted" = predict(hurdleNB, type = "response"),
           ".resid" = resid(hurdleNB, type = "response"))

7.8.2 Summarizing Training Sample Fits

Within hiv_train: POIS_aug, ZINB_aug, hurdlePOIS_aug, and hurdleNB_aug now contain both the actual counts (los) and the predicted counts (in .fitted) from POIS, ZINB, hurdlePOIS, and hurdleNB, respectively.

I am using yardstick to summarize the fit with the statistics rsq, rmse, and mae. I will then compare these values for each of the models.

mets <- metric_set(rsq, rmse, mae)

POIS_summary <- 
  mets(POIS_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "POIS") %>% relocate(model)

NB_summary <- 
  mets(NB_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "NB") %>% relocate(model)

ZIP_summary <- 
  mets(ZIP_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "ZIP") %>% relocate(model)

ZINB_summary <- 
  mets(ZINB_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "ZINB") %>% relocate(model)

hurdlePOIS_summary <- 
  mets(hurdlePOIS_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "hurdlePOIS") %>% relocate(model)

hurdleNB_summary <- 
  mets(hurdleNB_aug, truth = los, estimate = .fitted) %>%
  mutate(model = "hurdleNB") %>% relocate(model)

bind_rows(POIS_summary, NB_summary, ZIP_summary, 
          ZINB_summary, hurdlePOIS_summary, hurdleNB_summary) %>% 
  pivot_wider(names_from = model, values_from = .estimate) %>% 
  select(-.estimator) %>% kable(dig =  4)

.metric	POIS	NB	ZIP	ZINB	hurdlePOIS	hurdleNB
rsq	0.0361	0.0356	0.0361	0.0356	0.0361	0.0357
rmse	9.7706	9.7732	9.7704	9.7732	9.7703	9.7727
mae	5.0849	5.0803	5.0848	5.0803	5.0848	5.0796

Fit Statistics: discrimination and correlation

rsq: up to 3 decimal places all of the models are equally awful (R2=0.036)
- the Poisson based models (POIS, ZIP, hurdlePOIS) are negligibly better than the negative binomial all of which has an rsquared of 0.0361
rmse: all equally awful (9.77 days days)
mae: all equally awful (5.08 days)
- predicting los within a little less than a week in either direction is horrific.
- hurdleNB is better by decimal dust (5.0796 days)

7.8.3 Training Sample Assessment

My assessment is based on the rootogram (measures calibration) and the fit statistics (measures discrimination & correlation)

It is very interesting that even though the rootogram looks way better with the negative binomial based models, the summary statistics are still just as bad (or worse) as the Poisson based models. However, this is just in the training sample.

7.9 Validation: Test Sample Predictions

7.9.1 Predict the `los` counts for each subject in our test sample.

I am using the predict function to predict the los counts in the training sample (hiv_test2) with my 6 models (POIS, NB,ZIP, ZINB, hurdlePOIS, and hurdleNB)

test_1 <- predict(POIS, newdata = hiv_test2,
                  type.predict = "response")
test_2 <- predict(NB, newdata = hiv_test2,
                  type.predict = "response")
test_3 <- predict(ZIP, newdata = hiv_test2,
                  type.predict = "response")
test_4 <- predict(ZINB, newdata = hiv_test2,
                  type.predict = "response")
test_5 <- predict(hurdlePOIS, newdata = hiv_test2,
                  type.predict = "response")
test_6 <- predict(hurdleNB, newdata = hiv_test2,
                  type.predict = "response")

Next I am combining the various predictions into a tibble with the original holdout sample data.

test_res <- bind_cols(hiv_test2, 
              pre_m1 = test_1, pre_m2 = test_2, 
              pre_m3 = test_3, pre_m4 = test_4, 
              pre_m5 = test_5, pre_m6 = test_6)

names(test_res)

 [1] "key_nis"     "los"         "subst_abuse" "AIDS_f"      "age"        
 [6] "race"        "region"      "zipinc_qrtl" "insurance"   "ED_record"  
[11] "pre_m1"      "pre_m2"      "pre_m3"      "pre_m4"      "pre_m5"     
[16] "pre_m6"

7.9.2 Summarize fit in test sample for each model

I am getting the fit statistics (rsq, mse, mae) in the training sample by applying the mets command and then binding all of these fit statistic so that I will be able to make a comparison table.

m1_sum <- mets(test_res, truth = los, estimate = pre_m1) %>%
  mutate(model = "POIS") 
m2_sum <- mets(test_res, truth = los, estimate = pre_m2) %>%
  mutate(model = "NB") 
m3_sum <- mets(test_res, truth = los, estimate = pre_m3) %>%
  mutate(model = "ZIP")
m4_sum <- mets(test_res, truth = los, estimate = pre_m4) %>%
  mutate(model = "ZINB")
m5_sum <- mets(test_res, truth = los, estimate = pre_m5) %>%
  mutate(model = "hurdlePOIS")
m6_sum <- mets(test_res, truth = los, estimate = pre_m6) %>%
  mutate(model = "hurdleNB")


test_sum <- bind_rows(m1_sum, m2_sum, m3_sum, m4_sum,
                      m5_sum, m6_sum) %>%
  pivot_wider(names_from = model, 
              values_from = .estimate)

Now here is a table with all of the fit statistics in the training sample

test_sum %>%
  select(-.estimator) %>% kable(dig = 3)

.metric	POIS	NB	ZIP	ZINB	hurdlePOIS	hurdleNB
rsq	0.040	0.040	0.041	0.042	0.041	0.042
rmse	10.446	10.446	8.966	8.965	8.967	8.965
mae	5.348	5.347	5.014	5.008	5.014	5.008

Assessment of the validated fit statistics

rsq
- Firstly, It appears that we did not overfit the model because the rsq increased by about 0.006. (but then again the original rsq was so freaking low that there really wasn’t hardly any fitting going on)
- hurdleNB and ZINB are minisculely better than the rest (rsq=0.042)
rmse
- The models which took into account excess 0’s improved the rmse by about 1.48
- the best for ZINB and HurdleNB (but that’s looking out to 3 decimal points)
mae:
- the lowest for ZINB and hurdleNB

Conclusion

The validated fit statistics are slightly better for ZINB and hurdleNB than the other 4 models. When deciding between which to choose, it is important to consider healthcare the cost associated with length of stay. The hurdleNB looked slightly better by the rootogram because it predicted 0’s perfectly and the vuong test indicated that it’s predicted probabilities were possibly an improvement over the ZINB predictions. Nevertheless, the hurdleNB model over predicted 1’s, which the ZINB model excelled at. Therefore, even though hurdleNB looks slightly better, from the perspective of placing more weight on the public health impact of los (i.e. being able to more accurately predict higher los values ), the ZINB model may be a better choice. I still chose hurdleNB as my final model though.

7.10 Final model

7.10.1 Regression Ceofficients

Unfortunately I cannot make a tidy table of the regression coefficients with the hurdle model. Thus, I have to use summary. As explained above, the negative binomial based models (and Poisson Based), use a log as its link function, so the following coefficients will have to be interpreted as the log of los

summary(hurdleNB, exponentiate = TRUE, 
        conf.int = TRUE, conf.level = 0.95)


Call:
hurdle(formula = los ~ subst_abuse + AIDS_f + age + race + region + zipinc_qrtl + 
    insurance + ED_record, data = hiv_train2, dist = "negbin", zero.dist = "binomial")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.0570 -0.6606 -0.3654  0.1536 42.7721 

Count model coefficients (truncated negbin with log link):
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        1.7133477  0.0479004  35.769  < 2e-16 ***
subst_abuseyes    -0.1106580  0.0183972  -6.015 1.80e-09 ***
AIDS_fyes          0.5936846  0.0235180  25.244  < 2e-16 ***
age                0.0044841  0.0007756   5.781 7.41e-09 ***
raceWhite         -0.0253442  0.0220174  -1.151 0.249691    
raceHispanic      -0.0793630  0.0271422  -2.924 0.003456 ** 
raceOther          0.1498052  0.0497040   3.014 0.002579 ** 
raceAsian          0.0390066  0.0920889   0.424 0.671876    
raceNativeA        0.2792220  0.1248266   2.237 0.025294 *  
regionNortheast    0.0586923  0.0254488   2.306 0.021095 *  
regionSouth       -0.0152584  0.0270947  -0.563 0.573332    
regionWest        -0.1029889  0.0290843  -3.541 0.000399 ***
regionMidwest     -0.1849104  0.0307075  -6.022 1.73e-09 ***
zipinc_qrtl48-61K  0.0055749  0.0227359   0.245 0.806298    
zipinc_qrtl61-82K  0.0356644  0.0260498   1.369 0.170974    
zipinc_qrtl82K+    0.0257215  0.0303426   0.848 0.396603    
insuranceMedicare -0.1367474  0.0222882  -6.135 8.49e-10 ***
insurancePrivate  -0.1589493  0.0275364  -5.772 7.82e-09 ***
insuranceSelf_pay -0.1467868  0.0373464  -3.930 8.48e-05 ***
insuranceOther     0.0560034  0.0553886   1.011 0.311969    
ED_recordyes      -0.1613901  0.0222716  -7.246 4.28e-13 ***
Log(theta)        -0.1433089  0.0218107  -6.571 5.01e-11 ***
Zero hurdle model coefficients (binomial with logit link):
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        3.618242   0.299889  12.065  < 2e-16 ***
subst_abuseyes     0.051363   0.117207   0.438  0.66122    
AIDS_fyes          0.545901   0.175391   3.112  0.00186 ** 
age                0.014401   0.004837   2.978  0.00291 ** 
raceWhite         -0.025448   0.140676  -0.181  0.85645    
raceHispanic      -0.016695   0.175358  -0.095  0.92415    
raceOther         -0.316055   0.287683  -1.099  0.27193    
raceAsian          1.319614   1.011155   1.305  0.19187    
raceNativeA       -0.495601   0.598334  -0.828  0.40750    
regionNortheast    0.080640   0.167242   0.482  0.62968    
regionSouth        0.600939   0.204855   2.933  0.00335 ** 
regionWest        -0.400104   0.168265  -2.378  0.01742 *  
regionMidwest     -0.161986   0.182560  -0.887  0.37492    
zipinc_qrtl48-61K -0.145994   0.140938  -1.036  0.30026    
zipinc_qrtl61-82K  0.124368   0.175290   0.709  0.47802    
zipinc_qrtl82K+   -0.215456   0.181707  -1.186  0.23573    
insuranceMedicare -0.331031   0.141634  -2.337  0.01943 *  
insurancePrivate   0.017039   0.189766   0.090  0.92845    
insuranceSelf_pay -0.372190   0.229155  -1.624  0.10434    
insuranceOther    -0.761993   0.274917  -2.772  0.00558 ** 
ED_recordyes      -0.297915   0.154370  -1.930  0.05362 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta: count = 0.8665
Number of iterations in BFGS optimization: 28 
Log-likelihood: -4.895e+04 on 43 Df

If Harry and Larry have the same values for all other predictors but only Harry has a substance use disorder, the model predicts Harry to have a value of log(los) that is -0.117 lower than Larry’s log(los).
- this translates to a los that is 0.9 days lower for people with SUD 95% CI )0.86 0.93)
If Harry and Larry have the same values for all other predictors but only Harry has an AIDS defining illness, the model predicts Harry to have a value of log(los) that is 0.52 higher than Larry’s log(los).
- this translates to a los that is 1.7 days higher for PLWH with an AIDS defining illness (95% CI 1.73 to 1.9
In PLWH, the model suggests zipinc_qrtl doesn’t have a meaningful effect on los
PLWH with Medicaid actually appeared to have longer los in comparison to all other insurance groups, except other
- range of log los: -0.158 to -0.137
- this translates to about 1 day less for each insurance in comparison to medicaid (0.9 days)

I also started to make a plot of the coefficients with their point exponentiated los and their associated 95% CI, but then I stopped due to it taking a long time and I was afraid I would make an error. Note I made the los negative for subs_abuse yes, because in comparison to subst_abuse no, the los was shorter

fullpo_tte <- tibble(
    predictor = c("subst_abuseyes", "AIDS_fyes ", "age"),
    estimate = c(-0.9, 1.81, 1),
    conf.low = c(-0.93, 1.73, 1),
    conf.high = c(-0.86, 1.9, 1.01))

ggplot(fullpo_tte, aes(x = predictor, y = estimate)) +
    geom_errorbar(aes(ymax = conf.high, ymin = conf.low), width = 0.5) + 
    geom_label(aes(label = estimate), size = 5) +
    theme_bw() + 
  theme(axis.text.x = element_text(angle = 45,  hjust=1)) +
  labs(title = "Estimated Length of Stay and 95% Confidence Intervals by Covariate",
       subtitle = "HCUP-NIS 2018 Data",
       x = "")

Below is a table of the exponentiated log los. I inserted a negative sign if the log los was negative to indicate (if applicable) how much lower the los was for that predictor in comparison to the reference value.

Coefficient	`los` estimate (days)	P value
subst_abuseyes	-0.9	1.80e-09
AIDS_fyes	1.81	< 2e-16
age	1	7.41e-09
raceWhite	-0.97	0.249691
raceHispanic	-0.92	0.003456
raceOther	1.16	0.002579
raceAsian	1.04	0.671876
raceNativeA	1.32	0.025294
regionNortheast	1.06	0.021095
regionSouth	- 0.98	0.573332
regionWest	-0.9	0.000399
regionMidwest	- 0.83	1.73e-09
zipinc_qrtl48-61K	1.01	0.806298
zipinc_qrtl61-82K	1.04	0.170974
zipinc_qrtl82K+	1.03	0.396603
insuranceMedicare	- 0.87	8.49e-10
insurancePrivate	- 0.85	7.82e-09
insuranceSelf_pay	- 0.86	8.48e-05
insuranceOther	1.06	0.311969
ED_recordyes	0.85	4.28e-13

NOTE: this table would be much better if I had calculated the 95% CI for each coefficient instead of using the P values.

8 Conclusion analysis 2

My research question was: “in PLWH, how does the length of hospital stay for those with SUD compare to those without SUD in 2018?” After working in managed care for PLWH, I could see that people with mental illness/SUD had considerably more hospitalizations, complications, and overall higher utilization that those without mental illness/SUD. Since then, I have been interested in revealing patterns related to hospitalization in PLWH and mental illness/SUD because the knowledge could greatly affect resource allocation. I have not found any other studies that have evaluated this relationship in PLWH. However, a study by Ndanga et al. , that also used HCUP-NIS data (2010-2014), found that among the general population, that the average length of stay was 1 day longer for drug abusers than non-drug abusers (P<0.001; no CI).

In the training sample (n=16,883), after adjusting for patient factors (age, race, region, zipinc_qrtl, insurance, and AIDS_f), my model predicted that PLWH with a SUD have a length of stay that is 0.9 days (95% CI 0.86 to 0.93 days) lower than PLWH without a SUD. However, my model (Negative Binomial Hurdle Model) was not well calibrated as indicated by the rootogram which overpredicted counts of 1 day, slightly under predicted 2-8 days, over-predicted counts from 10-15 days, did pretty well with predictions out to 50 days, but then never predicted anything above 50 days (there were 130 people in the training sample with length of stays >50 day, with a max of 247 days). Furthermore, the model did not have strong discrimination as indicated by the low rsquare (0.04) and high mean absolute error (5 days).

Limitations in my approach:

I didn’t adjust for hospital related factors (eg bed size, academic vs nonacademic). This is in the hospital file (not the core file)
I didn’t do any evaluation for nonlinear terms (a spearman rho squared plot suggested may doing an interaction between region and AIDS_f
I didn’t take into account reason for hospitalization. Many studies that compare length of stays for people with mental illness compare how much longer it is given the same reason for hospitalization.
I do not show graphical representations of the coefficients with their confidence intervals. That would be so much more helpful for evaluating effect sizes.
There are certain variables that have been associated with LOS that I do not have access to:

marital status, employment, history of previous admission (Oid et al.)

Next steps

Include the hospital file to incorporate hospital related characteristics
Use propensity scores to make the groups more similar
Evaluate nonlinearity
Evaluate the effect of different kinds of SUD rather than lumping all into one big category
Finish my plot that shows the point estimate of each coefficient its associated confidence interval.
Do more thorough research on comorbidites which affect length of stay and flag those and then adjust for those (or ideally PS match on those)

Learned about statistics/data science

I was really able to see the difference between calibration (rootogram) and discrimination/correlation (fit statistics) and that just because a model looks MUCH better on the rootogram, that does not mean that it will have better fit statistics (they can even be worse)
I learned about all the different approaches to model count outcomes. I had forgotten the purpose of the hurdle model and I was also too lazy to try it (just go with the ZIP and ZINB), but I was surprised that the hurdle model actually looked better than the zero inflated model for the negative binomial regression
I took for granted the utility of the broom package, specifically the tidy function. I really wished I was able to apply the tidy function to the hurdle model, but instead I did a lot of work by hand that took hours.
I also wished that I could use orm with the hurdle model to be able to graph effect sizes, but that also does not’t work. This project has really shown me how much more insightful it is to graph effect sizes rather than showing them in a table. Thus, if you can get it with a model, make sure you take advantage of that feature and show it!

9 References and Acknowledgments

HCUP-NIS 2018 Data was purchased from: https://www.hcup-us.ahrq.gov/nisoverview.jsp

Hartzler B, Dombrowski JC, Crane HM, et al. Prevalence and Predictors of Substance Use Disorders Among HIV Care Enrollees in the United States. AIDS Behav. 2017;21(4):1138-1148.doi:10.1007/s10461-016-1584-6
Cook JA, Burke-Miller JK, Cohen MH, et al. Crack cocaine, disease progression, and mortality in a multicenter cohort of HIV-1 positive women [published correction appears in AIDS. 2008 Sep 12;22(14):i. Levine, Andrea [corrected to Levine, Alexandra M]]. AIDS. 2008;22(11):1355-1363. doi:10.1097/QAD.0b013e32830507f2
Lucas GM, Gebo KA, Chaisson RE, Moore RD. Longitudinal assessment of the effects of drug and alcohol abuse on HIV-1 treatment outcomes in an urban clinic. AIDS. 2002;16(5):767-774. doi:10.1097/00002030-200203290-00012
Anastos K, Schneider MF, Gange SJ, Minkoff H, Greenblatt RM, Feldman J, Levine A, Delapenha R, Cohen M. The association of race, sociodemographic, and behavioral characteristics with response to highly active antiretroviral therapy in women. J Acquir Immune Defic Syndr 2005;39:537–44
Ndanga M, Srinivasan S. Analysis of Hospitalization Length of Stay and Total Charges for Patients with Drug Abuse Comorbidity. Cureus. 2019 Dec 30;11(12):e6516. doi: 10.7759/cureus.6516. PMID: 32025435; PMCID: PMC6988730.
Khosravizadeh O, Vatankhah S, Bastani P, Kalhor R, Alirezaei S, Doosty F. Factors affecting length of stay in teaching hospitals of a middle-income country. Electron Physician. 2016;8(10):3042-3047. Published 2016 Oct 25. doi:10.19082/3042
Baek H, Cho M, Kim S, Hwang H, Song M, Yoo S. Analysis of length of hospital stay using electronic health records: A statistical and data mining approach. PLoS One. 2018;13(4):e0195901. Published 2018 Apr 13. doi:10.1371/journal.pone.0195901

10 Session Information

xfun::session_info()

R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8

Package version:
  askpass_1.1               assertthat_0.2.1         
  backports_1.1.9           base64enc_0.1-3          
  BayesFactor_0.9.12-4.2    bayestestR_0.9.0         
  BH_1.72.0.3               blob_1.2.1               
  bookdown_0.20             boot_1.3-26              
  broom_0.7.6               BWStest_0.2.2            
  callr_3.4.3               caret_6.0-86             
  cellranger_1.1.0          checkmate_2.0.0          
  class_7.3-18              cli_2.0.2                
  clipr_0.7.0               cluster_2.1.0            
  coda_0.19-3               codetools_0.2-18         
  colorspace_1.4-1          compiler_4.0.4           
  conquer_1.0.2             contfrac_1.1.12          
  correlation_0.6.1         countreg_0.2-1           
  cpp11_0.2.1               crayon_1.3.4             
  crosstalk_1.1.0.1         curl_4.3                 
  data.table_1.13.0         datasets_4.0.4           
  DBI_1.1.0                 dbplyr_1.4.4             
  desc_1.2.0                deSolve_1.28             
  digest_0.6.25             dplyr_1.0.2              
  e1071_1.7-3               effectsize_0.4.4-1       
  ellipsis_0.3.1            elliptic_1.4.0           
  evaluate_0.14             fansi_0.4.1              
  farver_2.0.3              forcats_0.5.0            
  foreach_1.5.0             foreign_0.8-81           
  Formula_1.2-3             fs_1.5.0                 
  furrr_0.1.0               future_1.18.0            
  gdata_2.18.0              generics_0.0.2           
  ggcorrplot_0.1.3          ggdendro_0.1.21          
  ggforce_0.3.2             ggformula_0.9.4          
  ggplot2_3.3.3             ggrepel_0.8.2            
  ggsignif_0.6.1            ggstance_0.3.4           
  ggstatsplot_0.7.2         globals_0.12.5           
  glue_1.4.2                gmodels_2.18.1           
  gmp_0.6-2                 gower_0.2.2              
  graphics_4.0.4            grDevices_4.0.4          
  grid_4.0.4                gridExtra_2.3            
  gtable_0.3.0              gtools_3.8.2             
  haven_2.3.1               highr_0.8                
  Hmisc_4.4-1               hms_0.5.3                
  htmlTable_2.0.1           htmltools_0.5.1.1        
  htmlwidgets_1.5.1         httr_1.4.2               
  hypergeo_1.2.13           insight_0.13.2           
  ipmisc_6.0.0              ipred_0.9-9              
  isoband_0.2.2             iterators_1.0.12         
  janitor_2.0.1             jpeg_0.1-8.1             
  jsonlite_1.7.0            KernSmooth_2.23.18       
  knitr_1.29                kSamples_1.2-9           
  labeling_0.3              labelled_2.6.0           
  lattice_0.20-41           latticeExtra_0.6-29      
  lava_1.6.7                lazyeval_0.2.2           
  leaflet_2.0.3             leaflet.providers_1.9.0  
  lifecycle_0.2.0           listenv_0.8.0            
  lubridate_1.7.9           magrittr_1.5             
  markdown_1.1              MASS_7.3-53              
  Matrix_1.3-2              MatrixModels_0.4-1       
  matrixStats_0.56.0        mc2d_0.1-19              
  memoise_1.1.0             methods_4.0.4            
  mgcv_1.8.33               mice_3.13.0              
  mime_0.9                  minqa_1.2.4              
  mitools_2.4               ModelMetrics_1.2.2.2     
  modelr_0.1.8              mosaic_1.7.0             
  mosaicCore_0.6.0          mosaicData_0.18.0        
  multcomp_1.4-13           multcompView_0.1-8       
  munsell_0.5.0             mvtnorm_1.1-1            
  naniar_0.5.2              nlme_3.1-152             
  nnet_7.3-15               numDeriv_2016.8.1.1      
  openssl_1.4.2             pairwiseComparisons_3.1.3
  paletteer_1.3.0           parallel_4.0.4           
  parameters_0.13.0         patchwork_1.0.1          
  pbapply_1.4-3             performance_0.7.1        
  pillar_1.4.6              pkgbuild_1.1.0           
  pkgconfig_2.0.3           pkgload_1.1.0            
  plyr_1.8.6                PMCMRplus_1.9.0          
  png_0.1-7                 polspline_1.1.19         
  polyclip_1.10-0           praise_1.0.0             
  prettyunits_1.1.1         prismatic_1.0.0          
  pROC_1.17.0.1             processx_3.4.3           
  prodlim_2019.11.13        progress_1.2.2           
  ps_1.3.4                  pscl_1.5.5               
  purrr_0.3.4               quantreg_5.61            
  R6_2.4.1                  raster_3.3.13            
  RColorBrewer_1.1-2        Rcpp_1.0.5               
  RcppArmadillo_0.9.900.2.0 RcppEigen_0.3.3.7.0      
  readr_1.3.1               readxl_1.3.1             
  recipes_0.1.13            rematch_1.0.1            
  rematch2_2.1.2            reprex_0.3.0             
  reshape_0.8.8             reshape2_1.4.4           
  rlang_0.4.10              rmarkdown_2.3            
  rmdformats_0.3.7          Rmpfr_0.8-4              
  rms_6.0-1                 rpart_4.1-15             
  rprojroot_1.3.2           rsample_0.0.7            
  rstudioapi_0.11           rvest_0.3.6              
  sandwich_2.5-1            scales_1.1.1             
  selectr_0.4.2             snakecase_0.11.0         
  sp_1.4.2                  SparseM_1.78             
  splines_4.0.4             SQUAREM_2020.4           
  stats_4.0.4               stats4_4.0.4             
  statsExpressions_1.0.1    stringi_1.4.6            
  stringr_1.4.0             SuppDists_1.1-9.5        
  survey_4.0                survival_3.2-7           
  sys_3.4                   tableone_0.12.0          
  testthat_2.3.2            TH.data_1.0-10           
  tibble_3.0.3              tidyr_1.1.2              
  tidyselect_1.1.0          tidyverse_1.3.0          
  timeDate_3043.102         tinytex_0.25             
  tools_4.0.4               tweenr_1.0.1             
  UpSetR_1.4.0              utf8_1.1.4               
  utils_4.0.4               vctrs_0.3.4              
  viridis_0.5.1             viridisLite_0.3.0        
  visdat_0.5.3              whisker_0.4              
  withr_2.2.0               WRS2_1.1-1               
  xfun_0.16                 xml2_1.3.2               
  yaml_2.2.1                yardstick_0.0.7          
  zeallot_0.1.0             zoo_1.8-8

Prevalence of AIDS defining illness in HIV patients with Substance Abuse