I have loaded 15 unique packages below.
library(haven) # bring in the SAS XPT data file from BRFSS
library(rsample)
library(janitor)
library(magrittr)
library(rms)
library(broom)
library(GGally)
library(naniar)
library(simputation)
library(patchwork)
library(ggrepel)
library(caret) # confusion matrix
library(yardstick)
library(pROC)
library(tidyverse)
theme_set(theme_bw())
Data source: Behavioral Risk Factor Surveillance System (BRFSS) is an annual database collected by the Centers for Disease Control. BRFSS conducts ≥ 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Who gathered the data: Centers for Disease Control (CDC)
When collected: 2019
Purpose: data for research purposes regarding health-related risk behaviors, chronic health conditions, and use of preventive services.
Nearly two thirds of states use BRFSS data to support health-related legislative efforts.
“By collecting behavioral health risk data at the state and local level, BRFSS has become a powerful tool for targeting and building health promotion activities”
Sampling Strategy:
Study Design: cross-sectional (survey data).
The subjects in BRFSS include non-institutionalized adults (>18 years old) living in all 50 states.
In this study, I will be focusing on the respondents who answered that they had a history of hepatitis C. The states where this particular question was asked include CT, GA, KY, LA, NC, TN, WV.
Here I am loading in the raw data, dat_raw
This gives the nationwide data, which has 418,268 rows and 342 columns.
dat_raw <- read_xpt("LLCP2019.XPT ")
I created a new data file, which I developed by
filtering away all observations except those that completed the hepatitis module and were told they had a history of HCV
selecting only the variables which will be used to develop the variables i will be studying or may need.
saving the result, which now has 569 rows and 11 columns.
hcv_raw <- dat_raw %>%
clean_names() %>% haven::zap_label() %>%
filter(state %in% c(9, 13, 21, 22, 37, 47, 54)) %>%
filter(toldhepc == 1) %>%
select(seqno, dispcode, state, toldhepc, trethepc, prirhepc, sexvar, income2, imprace, alcday5, physhlth)
write_csv(hcv_raw, "hcv_raw.csv")
As the first part of the cleaning step, I will use clean_names to make the variable names lower case and more easy to work with. I will also use zap_labels to get rid of any labels on the variables
brfss_hcv_raw <- read_csv("hcv_raw.csv")
Parsed with column specification:
cols(
seqno = col_double(),
dispcode = col_double(),
state = col_double(),
toldhepc = col_double(),
trethepc = col_double(),
prirhepc = col_double(),
sexvar = col_double(),
income2 = col_double(),
imprace = col_double(),
alcday5 = col_double(),
physhlth = col_double()
)
brfss_hcv_raw <- brfss_hcv_raw %>% clean_names() %>% haven::zap_label()
I will only be evaluating subjects who have a history of hepatitis C. Therefore, in this section I have:
filtered only states that used the hepatitis module (CT, GA, KY, LA, NC, TN, WV)
filtered only patients who have been told that they have hepatitis c, toldhepc
= 1
dat_hcv_clean <- brfss_hcv_raw %>% filter(state %in% c(9, 13, 21, 22, 37, 47, 54)) %>%
filter(toldhepc == 1)
The data set consists of 569 adults (age ≥18) from CT, GA, KY, LA, NC, TN, WV who have been told they have a history of hepatitis C.
dat_hcv_clean %>%
dim()
[1] 569 11
My planned analyses are:
How well race (white vs non-white) predicts history of HCV treatment after adjusting for physical health, income, and sex
How well physical health predicts the number of days a person drinks in the past month after adjusting for income, sex, and history of HCV treatment.
In this section I will be selecting and creating these variables.
Here I am creating the variables:
treated
from the raw variables trethepc
and prirhepc
values of 1 for trethepc
and prirhepc
indicate that a person has received hepatitis C treatment before and after year 2015, respectively. My new variable, treated
will combine these two variables to describe if someone has ever been treated in the past.
It has the levels 0 (no prior treatment) and 1 (prior treatment)
I have also releveled it to make sure 1 (treated) is the referent category
income
from the raw variable income2
I made it a factor, assigned meaningful names to the levels, and collapsed the categories
it has 6 levels each with at least 34 people per level:
0-9K
: less than $10,000 per year (n=79)10-19K
: $10,000 to less than $20,000 (n=147)20-34K
: $20,000 to less than $35,000 (n=105)35-49K
: $35,000 to less than $50,000 (n=50)50-74K
: $50,000 to less than $75,000 (n=34)75k+
: $75,000 or more (n=53)I converted the non-response values, ‘77’ and ‘99’ to ‘NA’
female
from the raw variable sexvar
white
from the raw variable imprace
dat_hcv_clean <- dat_hcv_clean %>%
mutate(treated = ifelse(trethepc == 1 | prirhepc == 1, "1", "0"),
income_f = fct_recode(factor(income2),
"0-9K" = "1",
"10-14K" = "2",
"15-19K" = "3",
"20-24K" = "4",
"25-34K" = "5",
"35-49K" = "6",
"50-74K"= "7",
"75K+" = "8",
"dontknow"= "77",
"refused" = "99")) %>%
mutate(income =
fct_collapse(income_f,
"10- 19k" = c("10-14K",
"15-19K"),
"20-34K" = c("20-24K", "25-34K"),
"35 - 74K"= c("35-49K", "50-74K"),
NULL = c("dontknow", "refused"))) %>%
mutate(income=fct_relevel(income, "0-9K")) %>%
mutate(female = fct_recode(factor(sexvar),
"1" = "2",
"0" = "1")) %>%
mutate(white =
fct_collapse(factor(imprace),
"1" = c("1"),
"0"= c("2", "3", "4", "5", "6"))) %>%
mutate(white= fct_relevel(white, "1", "0")) %>%
mutate(treated = as.factor(treated),
treated= fct_relevel(treated, "1", "0"))
Below I am checking that:
income
has the 5 levels described below and they are in a logical order
female
is a binary variable (1/0)
white
is a binary variable (1/0), with 1 appearing first
treated
is a binary variable (1/0), with 1 appearing first
all categories have at least 30 subjects
dat_hcv_clean %>% count(income)
# A tibble: 6 x 2
income n
<fct> <int>
1 0-9K 79
2 10- 19k 147
3 20-34K 105
4 35 - 74K 84
5 75K+ 53
6 <NA> 101
dat_hcv_clean %>% count(female)
# A tibble: 2 x 2
female n
<fct> <int>
1 0 325
2 1 244
dat_hcv_clean %>% count(white)
# A tibble: 2 x 2
white n
<fct> <int>
1 1 413
2 0 156
dat_hcv_clean %>% count(treated)
# A tibble: 2 x 2
treated n
<fct> <int>
1 1 410
2 0 159
Everything looks great. I will note that there are 101 missing observations for income
The raw variable, alcday5
, represents the days in past 30 a respondent had an alcoholic beverage (eg beer, wine, a malt beverage). The responses include:
101-107= #of days per week(101 = 1 day per week, 107 = 7 days per week)
201-230= #of days in past 30 days(201=1 day in last 30, 230 =30 days in last 30)
777 = Don’t know/Not sure
888 = No drinks in past 30 days
999 = Refused
BLANK = Not asked or Missing
I converted this to a single numeric value which represents the number of days in the past month that a person drank alcohol.
dat_hcv_clean <- dat_hcv_clean %>%
mutate(alcdays = as.numeric(alcday5)) %>%
mutate(alcdays = replace(alcdays, alcdays == 888, 0),
alcdays = replace(alcdays, alcdays %in% c(777, 999),NA)) %>%
mutate(alcdays = case_when(alcday5 > 200 & alcday5 < 231 ~ alcday5 - 200,
alcday5 > 100 & alcday5 < 108 ~ round((alcday5 - 100)*(30/7)), TRUE ~ alcdays))
The check below shows us that we have successfully made the conversion
dat_hcv_clean %>% count(alcday5, alcdays)
# A tibble: 26 x 3
alcday5 alcdays n
<dbl> <dbl> <int>
1 101 4 22
2 102 9 15
3 103 13 7
4 104 17 6
5 105 21 6
6 106 26 5
7 107 30 12
8 201 1 29
9 202 2 24
10 203 3 17
# … with 16 more rows
I will further check to ensure that it has an appropriate minimum and maximum (0 to 30)
mosaic::favstats(~alcdays, data= dat_hcv_clean )
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
min Q1 median Q3 max mean sd n missing
0 0 0 3 30 4.124101 8.387511 556 13
perfect.
The physhlth
variable represents the number of days during the past 30 days that a respondent’s physical health was not good.
I will replace the values 77 and 99 with NA, and replace 88 with 0.
dat_hcv_clean <- dat_hcv_clean %>%
mutate(physhealth = physhlth,
physhealth = replace(physhealth, physhealth %in% c(77, 99), NA), physhealth = replace(physhealth, physhealth == 88, 0))
The check below includes the key values (77,88,99) from physhlth
and we can see that those have successfully been converted in physhealth
dat_hcv_clean %>% count(physhlth, physhealth) %>% tail()
# A tibble: 6 x 3
physhlth physhealth n
<dbl> <dbl> <int>
1 28 28 4
2 29 29 2
3 30 30 110
4 77 NA 21
5 88 0 196
6 99 NA 2
I will further check to ensure that it has an appropriate minimum and maximum (0 to 30)
mosaic::favstats(~physhealth, data= dat_hcv_clean )
min Q1 median Q3 max mean sd n missing
0 0 4 20 30 10.04579 11.8971 546 23
Looks good.
Below we can see that the ID variable, seqno
, begins with the digits 2019000000
mosaic::favstats(~seqno, data=dat_hcv_clean)
min Q1 median Q3 max mean sd n
2.019e+09 2019001719 2019003441 2019005015 2019008999 2019003526 2167.51 569
missing
0
I have converted seqno
to ID
and removed the leading 2019000000. ID is a character variable.
dat_hcv_clean <- dat_hcv_clean %>%
mutate(ID = as.character(seqno - 2019000000))
Below we can see that all of the IDs have successfully been changed
mosaic::favstats(~ID, data=dat_hcv_clean)
Warning in FUN(eval(formula[[2]], data, .envir), ...): Auto-converting character
to numeric.
min Q1 median Q3 max mean sd n missing
41 1719 3441 5015 8999 3526.436 2167.51 569 0
Now that I have renamed, releveled, and created the variables I will need from the raw variables, I will select only the variables that I will use for my analysis
I am selecting:
ID
alcdays
treated
physhealth
income
white
and female
dat_hcv_clean <- dat_hcv_clean %>%
select(ID, treated, alcdays, physhealth, income, white, female)
I can see below that only the variables selected above are here, the column types match what I want, and all of the numbers of levels of categories are correct.
dat_hcv_clean %>%
mosaic::inspect()
Warning: `data_frame()` is deprecated as of tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
categorical variables:
name class levels n missing
1 ID character 550 569 0
2 treated factor 2 569 0
3 income factor 5 468 101
4 white factor 2 569 0
5 female factor 2 569 0
distribution
1 1446 (0.4%), 1479 (0.4%) ...
2 1 (72.1%), 0 (27.9%)
3 10- 19k (31.4%), 20-34K (22.4%) ...
4 1 (72.6%), 0 (27.4%)
5 0 (57.1%), 1 (42.9%)
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
...1 alcdays numeric 0 0 0 3 30 4.124101 8.387511 556 13
...2 physhealth numeric 0 0 4 20 30 10.045788 11.897101 546 23
Also it is important to note above that ID
is our subject identifier and it is a unique character value for each subject in the data.
Below is the clean tibble, dat_hcv_clean
dat_hcv_clean
# A tibble: 569 x 7
ID treated alcdays physhealth income white female
<chr> <fct> <dbl> <dbl> <fct> <fct> <fct>
1 41 0 9 0 35 - 74K 1 1
2 326 1 4 0 35 - 74K 0 0
3 348 1 0 17 10- 19k 1 0
4 491 1 0 30 10- 19k 1 0
5 534 1 0 0 0-9K 0 0
6 801 1 30 0 35 - 74K 1 0
7 870 1 0 25 <NA> 1 1
8 920 0 4 30 10- 19k 1 0
9 1086 0 3 1 <NA> 1 1
10 1094 1 0 7 10- 19k 1 0
# … with 559 more rows
There are 569 rows and 10 columns in dat_hcv_clean
ID
is the name of the character variable that provides the identifier for each row
each row has a unique value for ID
as can be seen below
dat_hcv_clean %>% select(ID) %>%
n_distinct()
[1] 550
saveRDS(dat_hcv_clean, "dat_hcv_clean.Rds")
Variable | Role | Type | Description |
---|---|---|---|
ID |
identifier | - | respondent identification number |
treated |
outcome & predictor | 2-cat | subject has been treated for HCV (1=Yes or 0=No) |
alcdays |
outcome | quant | # of days out of the past 30 in which the subject had at least one alcoholic drink. Possible values include 0 to 30. |
physhealth |
input | quant | # of days out of the past 30 on which the subject’s physical health was not good (eg physical illness, injury ). Possible values include 0 to 30. |
income |
input | 5-cat | annual household income from all sources. units are USD (0-9K, 10-19K, 20-34K, 35-74K, 75K+) |
white |
input | 2-cat | race and ethnicity, in 2 categories (1 = white, 0= non-white) |
female |
input | 2-cat | sex (1 = female, 0 = male) |
The results below match up with the code book above
dat_hcv_clean %>%
Hmisc::describe()
.
7 Variables 569 Observations
--------------------------------------------------------------------------------
ID
n missing distinct
569 0 550
lowest : 1024 1030 1051 1060 1061, highest: 931 943 948 958 995
--------------------------------------------------------------------------------
treated
n missing distinct
569 0 2
Value 1 0
Frequency 410 159
Proportion 0.721 0.279
--------------------------------------------------------------------------------
alcdays
n missing distinct Info Mean Gmd .05 .10
556 13 22 0.75 4.124 6.762 0 0
.25 .50 .75 .90 .95
0 0 3 17 30
lowest : 0 1 2 3 4, highest: 20 21 26 28 30
--------------------------------------------------------------------------------
physhealth
n missing distinct Info Mean Gmd .05 .10
546 23 23 0.945 10.05 12.49 0 0
.25 .50 .75 .90 .95
0 4 20 30 30
lowest : 0 1 2 3 4, highest: 22 25 28 29 30
--------------------------------------------------------------------------------
income
n missing distinct
468 101 5
lowest : 0-9K 10- 19k 20-34K 35 - 74K 75K+
highest: 0-9K 10- 19k 20-34K 35 - 74K 75K+
Value 0-9K 10- 19k 20-34K 35 - 74K 75K+
Frequency 79 147 105 84 53
Proportion 0.169 0.314 0.224 0.179 0.113
--------------------------------------------------------------------------------
white
n missing distinct
569 0 2
Value 1 0
Frequency 413 156
Proportion 0.726 0.274
--------------------------------------------------------------------------------
female
n missing distinct
569 0 2
Value 0 1
Frequency 325 244
Proportion 0.571 0.429
--------------------------------------------------------------------------------
Here’s where you would run describe
from Hmisc
.
My question is: In people with hepatitis C, how well does physical health predict the number of days a person drinks in the past month after adjusting for income, sex, and history of HCV treatment?
My quantitative outcome is, alcdays
, from the tibble, dat_hcv_clean
This outcome is particularly relevant to people with a history of hepatitis C because they have a higher risk for liver complications. Awareness of risk factors associated with increased days of alcohol consumption may be helpful for targeting patients in need of extra support.
There are 556 subjects (97.7%) with complete data on this variable
dat_hcv_clean %>% select(alcdays) %>% miss_case_table()
# A tibble: 2 x 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 556 97.7
2 1 13 2.28
The plot below displays the distribution of the quantitative outcome, alcdays
.
ggplot(dat_hcv_clean, aes(x = alcdays)) +
geom_histogram(binwidth=1,
fill = "royalblue",
color = "white") +
labs(title = "Histogram of days in the past 30 days a subject drank alcohol", subtitle = "556 observations from BRFSS 2019 data (excluding 13 missing cases)",
x = "Number of days", y = "Count")
Warning: Removed 13 rows containing non-finite values (stat_bin).
The data is right skewed, with the majority of subjects not drinking any alcohol in the past 30 days. There are some subjects who drank daily. This variable is discrete in that all possible values are captured as an integer and fall between 0 and 30.
Provided that the data is right skewed, a transformation might be appropriate. Some ‘step down’ transformations to consider include the log, square root, and inverse.
alcdays
meets the standard for at least 10 different ordered observed values.
My planned predictors include:
1.) income
: multicategorical (5 levels)
2.) physhealth
: continuous
3.) female
: binary
4.) treated
: binary
Below I am demonstrating that physhealth
has at least 10 different observed ordered values
ggplot(dat_hcv_clean, aes(x = physhealth)) +
geom_histogram(binwidth=1,
fill = "royalblue",
color = "white") +
labs(title = "Days with Poor Physical Health in the Past 30", subtitle = "20 different orderdered observed values for physhealth",
x = "Number of days", y = "Count")
Warning: Removed 23 rows containing non-finite values (stat_bin).
Below I am demonstrating that income
has 6 different levels (not including NA) and each level has at least 34 observations.
dat_hcv_clean %>%
tabyl(income)
income n percent valid_percent
0-9K 79 0.13884007 0.1688034
10- 19k 147 0.25834798 0.3141026
20-34K 105 0.18453427 0.2243590
35 - 74K 84 0.14762742 0.1794872
75K+ 53 0.09314587 0.1132479
<NA> 101 0.17750439 NA
With 569 observations, at most the model can contain ~8 [4 + (569-100)/100] candidate regression inputs. I have less than 8 predictors. (4<8)
My question is: In people with a history of HCV, how does the odds of receiving HCV treatment for a non-White person compare to the odds of receiving treatment for a person of White race after adjusting for physical health, income, and sex?’
-My binary outcome is, treated
, from the tibble, dat_hcv_clean
As a pharmacist with a experience in managed care, access to HCV treatment is especially important to me. HCV treatment has been revolutionized in the past decade as we have gone from incurable to curable in just 8 weeks. However, with treatment costing nearly $200,000, there are some restrictions in access. Explaining the magnitude of impact of factors associated with treatment will be very interesting to me.
There are no missing cases on this outcome.
dat_hcv_clean %>% select(treated) %>% miss_case_table()
# A tibble: 1 x 3
n_miss_in_case n_cases pct_cases
<int> <int> <dbl>
1 0 569 100
dat_hcv_clean
with each of the two possible values (1= yes, 0=no).dat_hcv_clean %>%
tabyl(treated) %>% kable(digits=2)
treated | n | percent |
---|---|---|
1 | 410 | 0.72 |
0 | 159 | 0.28 |
1.) white
: binary
2.) income
: 5 level multicategorical (explained in last section)
3.) female
: binary
4.) physhealth
: continuous (explained in last section)
The table below indicates that we have missing values for income
, physhealth
, and alcdays
.
miss_var_summary(dat_hcv_clean) %>% kable()
variable | n_miss | pct_miss |
---|---|---|
income | 101 | 17.750439 |
physhealth | 23 | 4.042179 |
alcdays | 13 | 2.284710 |
ID | 0 | 0.000000 |
treated | 0 | 0.000000 |
white | 0 | 0.000000 |
female | 0 | 0.000000 |
I am binding a shadow to keep track of imputation
hcv_sh <- bind_shadow(dat_hcv_clean)
I will impute (simple imputation) values for the predictors, income
and physhealth
. For the outcome, alcdays
, I will only use complete cases.
set.seed(22)
hcv_sh <- hcv_sh %>%
data.frame() %>%
filter(complete.cases(alcdays)) %>%
impute_pmm(., physhealth ~
alcdays + female + white + treated) %>%
impute_cart(., income ~
alcdays + female + white + treated + physhealth) %>% tbl_df()
Below we can see that I no longer have missingness for any of my variables (income and physhealth have been imputed)
miss_case_table(hcv_sh) %>% kable()
n_miss_in_case | n_cases | pct_cases |
---|---|---|
0 | 556 | 100 |
Below I have plotted the outcome, alcdays
(n=556) to evaluate normality of this outcome.
res <- mosaic::favstats(~ alcdays, data = hcv_sh)
bin_w <- 3
p1 <- ggplot(hcv_sh, aes(x = alcdays)) + geom_histogram(binwidth = bin_w,
fill = "salmon",
col = "white") + theme_light() +
stat_function(
fun = function(x) dnorm(x, mean = res$mean,
sd = res$sd) *res$n * bin_w,
col = "darkred", size = 1) +
labs(title = "Histogram with Normal fit",
x = "Days of Alcohol", y = "# of SEQNs")
p2 <- ggplot(hcv_sh, aes(sample = alcdays)) + geom_qq(col = "salmon") + geom_qq_line(col = "black") + theme_light() +
labs(title = "Normal Q-Q plot",
y = "Days of Alcohol")
p3 <- ggplot(hcv_sh, aes(x = "", y = alcdays)) + geom_violin() +
geom_boxplot(width = 0.2, fill = "salmon", outlier.color = "red") +
theme_light() +
coord_flip() +
labs(title = "Boxplot with Violin",
x = "", y = "Days of Alcohol")
p1 + p2 - p3 + plot_layout(ncol = 1, height = c(3, 1)) + plot_annotation(title = "Distribution of alcdays is highly right skewed", subtitle= "N=569 from BRFSS 2019 data")
We can see that there might be right skew, which means that I will need to evaluate transformations that might make a Normal model more plausible. For right skew, I will try taking a step “down” the ladder from power 1 to lower powers (square root, log, and inverse)
Below I have plotted the square root, log, and inverse of the outcome, alcdays
p2 <- ggplot(hcv_sh, aes(sample = alcdays)) + geom_qq() + geom_qq_line(col = "red") + labs(title = "untransformed alcdays")
p3 <- ggplot(hcv_sh, aes(sample = log(alcdays + 1))) + geom_qq() + geom_qq_line(col = "red") + labs(title = "log(alcdays)")
p4 <- ggplot(hcv_sh, aes(sample = sqrt(alcdays + 1))) + geom_qq() + geom_qq_line(col = "red") + labs(title = "square root of alcdays")
p5 <- ggplot(hcv_sh, aes(sample = 1/(alcdays +1))) + geom_qq() + geom_qq_line(col = "red") + labs(title = "inverse of alcdays")
( p2 + p3) / (p4 + p5) +
plot_annotation(title = "Evaluating transformations that step down the ladder (alcdays is right skewed)", subtitle = "hcv_sh sample (n=569): log and inverse transformations are the most normally distributed")
The BoxCox function indicates that an inverse transformation of alcdays
is necessary.
mod_0 <- lm((alcdays +1) ~ physhealth + income + female + treated, data = hcv_sh)
car::boxCox(mod_0)
Suggesting a transformation: suggesting -1 (inverse)
summary(car::powerTransform(mod_0))
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -0.9312 -1 -1.05 -0.8123
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 324.4351 1 < 2.22e-16
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 1692.268 1 < 2.22e-16
I had originally developed this report using the inverse transformation, however, the problem here is not skew. We will see with the residual plots that the problem is actually a floor and a ceiling effect. Thus, I have re-done this section without the Box-Cox suggested transformation.
The scatterplot matrix below shows the relationship between the 4 predictors and the outcome, alcdays
.
There is no collinearity between predictors because I only have one quantitative predictor.
hcv_sh%>%
select(physhealth, income, female, treated, alcdays) %>%
ggpairs(., title = "Scatterplot Matrix: 4 predictors & the transformed outcome (N=569)",
lower = list(combo = wrap("facethist", bins = 20)))
Model A predicts alcdays
using the predictors physhealth
, income
, female
, and treated
modA_ols
uses ols to fit the model
d <- datadist(hcv_sh)
options(datadist = "d")
modA_ols <- ols(alcdays ~ physhealth + income + female + treated, data = hcv_sh, x = TRUE, y = TRUE)
modA_lm
uses lm to fit the model
modA_lm<- lm(alcdays ~ physhealth + income + female + treated, data = hcv_sh)
Below is a tidied table of the estimates and their 90% confidence intervals
tidy(modA_lm, conf.int = 0.9) %>%
select(term, estimate, std.error, conf.low, conf.high) %>% kable(digits=3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 5.141 | 1.061 | 3.057 | 7.226 |
physhealth | -0.044 | 0.030 | -0.103 | 0.014 |
income10- 19k | -0.713 | 1.068 | -2.812 | 1.385 |
income20-34K | 4.081 | 1.157 | 1.808 | 6.353 |
income35 - 74K | 1.211 | 1.273 | -1.290 | 3.711 |
income75K+ | 2.126 | 1.476 | -0.774 | 5.026 |
female1 | -3.522 | 0.692 | -4.881 | -2.163 |
treated0 | -0.404 | 0.763 | -1.902 | 1.094 |
Below are the summary statistics: R-square, AIC and BIC
glance(modA_lm) %>%
select(r.squared, AIC, BIC) %>%
kable(digits = 3)
r.squared | AIC | BIC |
---|---|---|
0.104 | 3898.712 | 3937.599 |
Our estimated R-square across n = 40 training samples was 0.1127, but in the resulting tests, the average R-square was only 0.093.
This suggests an optimism of 0.1127 - 0.093 = 0.0224
The new estimate of R2 corrected for overfitting, is 0.0816. This is probably a better estimate of what our results might look like in new data that were similar to the data I used in building model A than our initial estimate of 0.1040.
set.seed(35)
validate(modA_ols, method="boot", B=40)
index.orig training test optimism index.corrected n
R-square 0.1040 0.1127 0.0903 0.0224 0.0816 40
MSE 62.9173 61.0827 63.8839 -2.8012 65.7185 40
g 3.0957 3.1500 2.9310 0.2190 2.8767 40
Intercept 0.0000 0.0000 0.3187 -0.3187 0.3187 40
Slope 1.0000 1.0000 0.9341 0.0659 0.9341 40
Here, I will run a set of residual plots for each model in order to evaluate the regression assumptions:
Linearity
Constant variance
Normality
NOTE: this is cross sectional data, therefore the data are independent
aug1 <- augment(modA_lm, data = hcv_sh)
p1 <- ggplot(aug1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Residuals vs. Fitted",
x = "Fitted Value of alcdays", y = "Residual")
p2 <- ggplot(aug1, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
y = "Standardized Residual",
x = "Standard Normal Quantiles")
p3 <- ggplot(aug1, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Scale-Location Plot",
x = "Fitted Value of alcdays",
y = "|Std. Residual|^(1/2)") +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed")
p4 <- ggplot(aug1, aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_point(data = aug1 %>% filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug1 %>% filter(.cooksd >= 0.5),
aes(label = ID), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Residuals vs. Leverage",
x = "Leverage", y = "Standardized Residual") +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed")
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Assessing Residuals for Model A with the BRFSS data (N=556)",caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
The residuals appear to have a pattern (a curve) with respect to the predicted (fitted) values. This indicates Non-Linearity.
alcdays
, and all of those people end up on the diagonal line. Thus, a linear model will not fit that well in this situation.The subjects with the largest absolute residuals were IDs= 2147, 1465, 7121 each of which has a positive residual (i.e. they represent under-predictions by the model)
The loess smooth in this plot has an upward slope, rather than flat. Thus, the data are heteroscedastic.
We can also see the plot thickening in residual vs. fitted plot, which implies non-constant variance.
The residuals vs. leverage plot shows many standardized residuals above 3. This is not normal, especially with a data set of this size. Thus we have extreme outliers due to unusual residuals (seen in QQ plot and leverage plot)
We also do not see any points with high leverage (3 times the average leverage of a point in the dataset), which is indicated by the dashed vertical line. Therefore, we do not have any extreme outliers due to an unusual combination of predictor values
There are no highly influential points (all points have a cooks distance of <0.5).
Below is the Spearman rho squared plot to evaluate the predictive punch of each of my variables in model A.
spear_modA <- spearman2(alcdays ~ physhealth + income + female + treated, data = hcv_sh)
plot(spear_modA)
income
, a mulitcategorical variable, is the most attractive candidate for a non-linear term.
This does not suggest that income
actually needs a non-linear term or will show significant non-linearity, but it packs the most potential predictive punch, so the degrees of freedom will be well spent.
I will create an interaction term with the second-largest rho-squared predictor, female
. This interaction will spend 5 degrees of freedom.
In order to spend 6 additional degrees of freedom beyond my main effects model, I will add a second interaction term that costs an addition 1 df. Based on the spearman plot, the wisest choice would be an interaction between female
and physhealth
Here I am fitting my “augmented model” incorporating the non-linear terms of an interaction between:
female
and income
=5 df
female
and physhealth
=1 df.
I am fitting with both ols and lm:
modA_ols
uses ols to fit the model
dd <- datadist(hcv_sh)
options(datadist = "dd")
modB_ols <- ols(alcdays ~ physhealth + income + female + treated + income*female + female %ia% physhealth , data = hcv_sh, x = TRUE, y = TRUE)
modA_lm
uses lm to fit the model
modB_lm<- lm(alcdays ~ physhealth + income + female + treated + income*female + female %ia% physhealth, data = hcv_sh)
Below is a nomogram which describes modB_ols
plot(nomogram(modB_ols),cex.axis=.5,cex.var=.5, tcl=-.15)
Below we an see that income
has some value, but treated
, physhealth
, and female
do not provide much value.
plot(summary(modB_ols))
Since model A is a subset of model B, I can compare these models with ANOVA tests.
anova(modA_lm, modB_lm)
Analysis of Variance Table
Model 1: alcdays ~ physhealth + income + female + treated
Model 2: alcdays ~ physhealth + income + female + treated + income * female +
female %ia% physhealth
Res.Df RSS Df Sum of Sq F Pr(>F)
1 548 34982
2 543 34456 5 526.42 1.6592 0.1427
The improvement from model B to model A doesn’t appear to meet the standard for a statistically detectable impact based on this ANOVA comparison.
I have plotted the residual plots for Model B below.
aug1 <- augment(modB_lm, data = hcv_sh)
p1 <- ggplot(aug1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Residuals vs. Fitted",
x = "Fitted Value of alcdays", y = "Residual")
p2 <- ggplot(aug1, aes(sample = .std.resid)) +
geom_qq() +
geom_qq_line(col = "red") +
labs(title = "Normal Q-Q plot",
y = "Standardized Residual",
x = "Standard Normal Quantiles")
p3 <- ggplot(aug1, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
labs(title = "Scale-Location Plot",
x = "Fitted Value of alcdays",
y = "|Std. Residual|^(1/2)") +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed")
p4 <- ggplot(aug1, aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_point(data = aug1 %>% filter(.cooksd >= 0.5),
col = "red", size = 2) +
geom_text_repel(data = aug1 %>% filter(.cooksd >= 0.5),
aes(label = ID), col = "red") +
geom_smooth(method = "loess", formula = y ~ x, se = F) +
geom_vline(aes(xintercept = 3*mean(.hat)), lty = "dashed") +
labs(title = "Residuals vs. Leverage",
x = "Leverage", y = "Standardized Residual") +
geom_point(data = aug1 %>%
slice_max(abs(.resid), n = 3),
col = "red", size = 1) +
geom_text_repel(data = aug1 %>%
slice_max(abs(.resid), n = 3),
aes(label = ID), col = "red") +
geom_abline(intercept = 0, slope = 0, lty = "dashed")
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Assessing Residuals for Model B with the BRFSS data (N=556)",caption = "If applicable, Cook's d >= 0.5 shown in red in bottom right plot.")
We can see that there is no real improvement in the residuals vs fitted plots, which isn’t a surprise because none of the nonlinear terms included were helpful.
However, now the three main outlying observations have changed. Also, we have 4 points with meaningful leverage now, which we didn’t have with model A.
Below I am partitioning the data into a testing and training sample to see how the models would perform in “new” data.
set.seed(12) ## to make the work replicable in the future
hcv_split <- initial_split(hcv_sh, prop = 3/4)
train_hcv <- training(hcv_split)
test_hcv<- testing(hcv_split)
we can verify we did a 3/4 split for the training and testing data set.
dim(train_hcv); dim(test_hcv)
[1] 417 14
[1] 139 14
Below I am fitting both model A and model B to the training sample
modA_lm_t<- lm(alcdays ~ physhealth + income + female + treated, data = train_hcv)
modB_lm_t<- lm(alcdays ~ physhealth + income + female + treated + income*female + female %ia% physhealth, data = train_hcv)
Below I am fitting both model A and model B to the testing sample
mA_test_aug <- augment(modA_lm_t, newdata = test_hcv)
mB_test_aug <- augment(modB_lm_t, newdata = test_hcv)
We can see below that the mean absolute error is lower (better) in model A than in model B.
testing_mae <- bind_rows(
mae(mA_test_aug, truth = alcdays, estimate = .fitted),
mae(mB_test_aug , truth = alcdays, estimate = .fitted))%>%
mutate(model = c("model A", "model B"))
testing_mae %>% kable(dig = 3)
.metric | .estimator | .estimate | model |
---|---|---|---|
mae | standard | 5.862 | model A |
mae | standard | 5.922 | model B |
We can see below that the rmse is slightly lower (better) in model A than model B.
testing_rmse <- bind_rows(
rmse(mA_test_aug, truth = alcdays, estimate = .fitted),
rmse(mB_test_aug , truth = alcdays, estimate = .fitted))%>%
mutate(model = c("model A", "model B"))
testing_rmse %>% kable(dig = 3)
.metric | .estimator | .estimate | model |
---|---|---|---|
rmse | standard | 9.273 | model A |
rmse | standard | 9.307 | model B |
The r-square in new testing data is better (higher) in model A than it is in model B
testing_r2 <- bind_rows(
rsq(mA_test_aug, truth = alcdays, estimate = .fitted),
rsq(mB_test_aug, truth = alcdays, estimate = .fitted)) %>%
mutate(model = c("model A", "model B"))
testing_r2 %>% kable(dig = 4)
.metric | .estimator | .estimate | model |
---|---|---|---|
rsq | standard | 0.1516 | model A |
rsq | standard | 0.1301 | model B |
I have also produced validated R-square statistics for Model B within ols through the validate function
set.seed(39)
validate(modB_ols, method="boot", B=40)
index.orig training test optimism index.corrected n
R-square 0.1175 0.1390 0.0972 0.0418 0.0758 40
MSE 61.9705 61.6134 63.3966 -1.7832 63.7538 40
g 3.1486 3.4353 2.9453 0.4900 2.6586 40
Intercept 0.0000 0.0000 0.5255 -0.5255 0.5255 40
Slope 1.0000 1.0000 0.8666 0.1334 0.8666 40
Below I have presented a comparison of validated r-squared results (from ols) across the two models:
Model A: 0.0816
Model B: 0.0758
We can see that the validated r-squared results from ols are also better for model A.
I prefer the “main effects” model A based on:
my overall assessment of fit quality: slightly better mae, rmse, and r-square in testing data. It also had a slightly better validated r-squared using ols.
adherence to assumptions as seen in residuals: neither model A nor B adhered well to assumptions, so there wasn’t much of a choice here
adding the terms in the augmented model is not worth the complication nor additional degrees of freedom spent. In general, we prefer simple models.
The model parameters of model A are listed in the table below.
The 90% confidence interval crossed zero for all parameters except for female
and the income
categories 20-34K, 35-74K, and 75K+.
tidy(modA_lm, conf.int = 0.9) %>%
select(term, estimate, conf.low, conf.high) %>% kable(digits=3)
term | estimate | conf.low | conf.high |
---|---|---|---|
(Intercept) | 5.141 | 3.057 | 7.226 |
physhealth | -0.044 | -0.103 | 0.014 |
income10- 19k | -0.713 | -2.812 | 1.385 |
income20-34K | 4.081 | 1.808 | 6.353 |
income35 - 74K | 1.211 | -1.290 | 3.711 |
income75K+ | 2.126 | -0.774 | 5.026 |
female1 | -3.522 | -4.881 | -2.163 |
treated0 | -0.404 | -1.902 | 1.094 |
Below is a summary of the effect sizes. The variable female
appears to have the strongest effect. My main predictor, physhealth
, does not. The effect of income varies, but we can see the most contrast between an income of 20-30K and 10-19K.
summary(modA_ols, conf.int=0.9) %>% kable(dig=2)
Low | High | Diff. | Effect | S.E. | Lower 0.9 | Upper 0.9 | Type | |
---|---|---|---|---|---|---|---|---|
physhealth | 0 | 20 | 20 | -0.88 | 0.60 | -1.87 | 0.10 | 1 |
income - 0-9K:10- 19k | 2 | 1 | NA | 0.71 | 1.07 | -1.05 | 2.47 | 1 |
income - 20-34K:10- 19k | 2 | 3 | NA | 4.79 | 0.89 | 3.32 | 6.26 | 1 |
income - 35 - 74K:10- 19k | 2 | 4 | NA | 1.92 | 1.04 | 0.22 | 3.63 | 1 |
income - 75K+:10- 19k | 2 | 5 | NA | 2.84 | 1.28 | 0.73 | 4.95 | 1 |
female - 1:0 | 1 | 2 | NA | -3.52 | 0.69 | -4.66 | -2.38 | 1 |
treated - 0:1 | 1 | 2 | NA | -0.40 | 0.76 | -1.66 | 0.85 | 1 |
physhealth
interpretation: if we have two subjects, Al and Bob, who are the same sex, have the same income category, and same treatment status, but Al’s physhealth
is 0 and Bob’s physhealth
is 20, then model A projects that Bob’s alcdays
will be 0.88 days lower than will Al’s.
physhealth
effect size is (-1.86, 0.10), so holding everything else constant, it seems that this effect size is not statistically significant.This relationship between physical health and alcohol use is especially interesting to me, especially in this hepatitis C population. According to the study, “Predictors of mental and physical health in non-cirrhotic patients with viral hepatitis: a case control study”, history of alcohol abuse was a strong predictor for physical health scores. However, with this data and this model selection, I was unable to show the relationship between the two.
Below is a plot that specifies the effect sizes for all elements of my final model, Model A.
plot(summary(modA_ols))
Below is the validated statistic of R-square, which is 0.0830.
set.seed(35)
validate(modA_ols, method="boot", B=40)
index.orig training test optimism index.corrected n
R-square 0.1040 0.1127 0.0903 0.0224 0.0816 40
MSE 62.9173 61.0827 63.8839 -2.8012 65.7185 40
g 3.0957 3.1500 2.9310 0.2190 2.8767 40
Intercept 0.0000 0.0000 0.3187 -0.3187 0.3187 40
Slope 1.0000 1.0000 0.9341 0.0659 0.9341 40
The predicted alcdays
for a person who had 1 bad physical health day in the past 30 days, has an annual income between $10-19k, is male, and has been treated for hepatitis C is 4.3 days (90% CI= -8.9 to 17.5 days). Clearly, this prediction is not helpful at all. However, that is not a surprise, because as we can see from the residual plots there is a floor and a ceiling effect and a linear regression model is not appropriate for this type of outcome. This poor model choice is demonstrated by our high mae and negligible r squared. Instead, we should be using a model for count outcomes (eg a zero-inflated Poisson).
predict(modA_ols, newdata=tibble(physhealth=1, income = "10- 19k", female=0, treated=1), conf.int =0.90, conf.type = "individual")
$linear.predictors
1
4.383784
$lower
1
-8.836261
$upper
1
17.60383
Below is a nomogram for model A.
plot(nomogram(modA_ols))
My question is: In people with a history of HCV, what are the odds of a person of non-white race receiving HCV treatment after adjusting for physical health, income, and sex?
The table below indicates that we have missing values for income
, physhealth
, and alcdays
.
miss_var_summary(dat_hcv_clean) %>% knitr::kable()
variable | n_miss | pct_miss |
---|---|---|
income | 101 | 17.750439 |
physhealth | 23 | 4.042179 |
alcdays | 13 | 2.284710 |
ID | 0 | 0.000000 |
treated | 0 | 0.000000 |
white | 0 | 0.000000 |
female | 0 | 0.000000 |
I will impute values for the predictors, income
and physhealth
.
set.seed(44)
hcv_l <- dat_hcv_clean %>%
data.frame() %>%
impute_pmm(., physhealth ~ female + white + treated) %>%
impute_cart(., income ~ female + white + treated + physhealth) %>% tbl_df()
Below we can see that I no longer have missingness for any of my predictors (income
and physhealth
have been imputed). It is okay that we have missingness for aldcays
, because that is not one of my predictors.
miss_case_table(hcv_l) %>% kable()
n_miss_in_case | n_cases | pct_cases |
---|---|---|
0 | 556 | 97.71529 |
1 | 13 | 2.28471 |
treated
using the predictors white
, physhealth
, income
, and female
.modY_lrm
uses lrm to fit the model
Y <- datadist(hcv_l)
options(datadist = "Y")
modY_lrm <- lrm(treated ~ white + physhealth + income + female, data = hcv_l , x = TRUE, y = TRUE)
modY_glm
uses glm to fit the model
modY_glm <- glm(treated ~ white + physhealth + income + female,
data = hcv_l,
family = binomial(link = logit))
Below is a table of the coefficients on a log odds scale:
tidy(modY_glm, conf.int=TRUE, conf.level = 0.9) %>%
select(term, estimate, std.error, conf.low, conf.high) %>% kable(digits=3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | -0.835 | 0.295 | -1.330 | -0.357 |
white0 | -0.643 | 0.234 | -1.036 | -0.266 |
physhealth | 0.004 | 0.008 | -0.010 | 0.018 |
income10- 19k | -0.255 | 0.296 | -0.738 | 0.239 |
income20-34K | -0.019 | 0.317 | -0.538 | 0.507 |
income35 - 74K | -0.361 | 0.345 | -0.930 | 0.208 |
income75K+ | 0.019 | 0.400 | -0.647 | 0.674 |
female1 | 0.360 | 0.194 | 0.040 | 0.680 |
tidy(modY_glm, exponentiate=TRUE, conf.int=TRUE, conf.level = 0.9) %>%
select(term, estimate, std.error, conf.low, conf.high) %>% kable(digits=3)
term | estimate | std.error | conf.low | conf.high |
---|---|---|---|---|
(Intercept) | 0.434 | 0.295 | 0.265 | 0.700 |
white0 | 0.526 | 0.234 | 0.355 | 0.766 |
physhealth | 1.004 | 0.008 | 0.990 | 1.018 |
income10- 19k | 0.775 | 0.296 | 0.478 | 1.270 |
income20-34K | 0.981 | 0.317 | 0.584 | 1.660 |
income35 - 74K | 0.697 | 0.345 | 0.395 | 1.231 |
income75K+ | 1.019 | 0.400 | 0.524 | 1.962 |
female1 | 1.433 | 0.194 | 1.041 | 1.974 |
Below are the key fit summary statistics like the Nagelkerke R-square and the area under the ROC curve as they are presented in the lrm output
The r square is very low (0.042) as well as the C statistic (0.605).
modY_lrm
Logistic Regression Model
lrm(formula = treated ~ white + physhealth + income + female,
data = hcv_l, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 569 LR chi2 15.04 R2 0.038 C 0.601
1 410 d.f. 7 g 0.429 Dxy 0.201
0 159 Pr(> chi2) 0.0355 gr 1.536 gamma 0.204
max |deriv| 4e-11 gp 0.083 tau-a 0.081
Brier 0.196
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.8345 0.2951 -2.83 0.0047
white=0 -0.6428 0.2337 -2.75 0.0060
physhealth 0.0037 0.0085 0.44 0.6623
income=10- 19k -0.2553 0.2963 -0.86 0.3890
income=20-34K -0.0193 0.3169 -0.06 0.9515
income=35 - 74K -0.3605 0.3451 -1.04 0.2961
income=75K+ 0.0186 0.4003 0.05 0.9630
female=1 0.3601 0.1943 1.85 0.0638
Below is the code to augment modY_glm
in order to get the predicted values
hcv_aug <- augment(modY_glm, hcv_l, type.predict = "response")
I have plotted modY_glm fits by observed treatment status. Certainly it appears as though most of our predicted probabilities (of treated) for subjects who were not actually treated are quite large.
ggplot(hcv_aug, aes(x = factor(treated), y = .fitted, col = factor(treated))) + geom_boxplot() +
geom_jitter(width = 0.1) + guides(col = FALSE)
Below is the confusion matrix. Rather than setting the cutoff at 0.5, I set it at 0.35 after evaluating the plot above. We do not have any predicted values as high as 0.5, so I needed to make it something lower.
hcv_aug %$%
confusionMatrix(
data = factor(.fitted >= 0.35),
reference = factor(treated == 1),
positive = "TRUE"
)
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 121 341
TRUE 38 69
Accuracy : 0.3339
95% CI : (0.2952, 0.3743)
No Information Rate : 0.7206
P-Value [Acc > NIR] : 1
Kappa : -0.0447
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1683
Specificity : 0.7610
Pos Pred Value : 0.6449
Neg Pred Value : 0.2619
Prevalence : 0.7206
Detection Rate : 0.1213
Detection Prevalence : 0.1880
Balanced Accuracy : 0.4646
'Positive' Class : TRUE
Key results of the confusion matrix include:
Below is a nomogram which describes the model, modY_lrm
.
plot(nomogram(modY_lrm))
Below is the Spearman rho squared plot to evaluate the predictive punch of each of my variables in model A.
spear_modY <- spearman2(treated ~ white + physhealth + income + female,
data = hcv_l)
plot(spear_modY)
The Spearman rho-squared plot suggests I should use my degrees of freedom in the following ways to spend 3-6 more degrees of freedom:
white
and female
: 1 dfphyshealth
: 3 dfI am fitting model Z with both lrm and glm:
modZ_lrm
uses lrm to fit the model
zz <- datadist(hcv_sh)
options(datadist = "zz")
modZ_lrm <- lrm(treated ~ white + rcs(physhealth,4) + income + female + white*female, data = hcv_l , x = TRUE, y = TRUE)
modZ_glm
uses glm to fit the model
modZ_glm <- glm(treated ~ white + rcs(physhealth,4) + income + female + white*female,
data = hcv_l,
family = binomial(link = logit))
Below is a nomogram describing model Z
plot(nomogram(modZ_lrm))
Below is a plot of effects of Model Z. We can see that the only predictor which may provide value would be female
.
plot(summary(modZ_lrm))
Since model Y is a subset of model Z, I can compare these models with ANOVA tests.
anova(modY_glm, modZ_glm)
Analysis of Deviance Table
Model 1: treated ~ white + physhealth + income + female
Model 2: treated ~ white + rcs(physhealth, 4) + income + female + white *
female
Resid. Df Resid. Dev Df Deviance
1 561 659.13
2 558 657.02 3 2.1105
The addition of a restricted cubic spline with 4 knots for physhealth
and an interaction between female
and white
reduces the lack of fit by 2.11 points, at a cost of 3 degrees of freedom. Thus, there are no statistically detectable improvements in prediction.
I have produced a confusion matrix using the same prediction rule that I used in Model Y (.fitted>0.35).
hcv_augZ <- augment(modZ_glm, hcv_l, type.predict = "response")
hcv_augZ %$%
confusionMatrix(
data = factor(.fitted >= 0.35),
reference = factor(treated == 1),
positive = "TRUE"
)
Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 114 332
TRUE 45 78
Accuracy : 0.3374
95% CI : (0.2986, 0.3779)
No Information Rate : 0.7206
P-Value [Acc > NIR] : 1
Kappa : -0.0598
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1902
Specificity : 0.7170
Pos Pred Value : 0.6341
Neg Pred Value : 0.2556
Prevalence : 0.7206
Detection Rate : 0.1371
Detection Prevalence : 0.2162
Balanced Accuracy : 0.4536
'Positive' Class : TRUE
Below is a comparison of the specificity, sensitivity and PPV for Model Z
Confusion Matrix Statistic | Model Y value | Model Z value |
---|---|---|
specificity | 0.7610 | 0.7170 |
sensitivity | 0.1683 | 0.1902 |
PPV | 0.6449 | 0.6341 |
We can see that only the sensitivity improves with model Z.
Below are the key fit summary statistics like the Nagelkerke R-square and the area under the ROC curve as they are presented in the lrm output for model Z
The r square is very low (0.043) as well as the C statistic (0.611). These values are negligibly better than model Y’s as seen in the table below.
Table 1: index fit quality measures
Key Fit Statistics | Model Y value | Model Z value |
---|---|---|
R2 | 0.042 | 0.043 |
Brier | 0.196 | 0.196 |
C | 0.605 | 0.611 |
The values for model Z came from:
modZ_lrm
Logistic Regression Model
lrm(formula = treated ~ white + rcs(physhealth, 4) + income +
female + white * female, data = hcv_l, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 569 LR chi2 17.15 R2 0.043 C 0.611
1 410 d.f. 10 g 0.453 Dxy 0.222
0 159 Pr(> chi2) 0.0710 gr 1.574 gamma 0.225
max |deriv| 2e-11 gp 0.089 tau-a 0.089
Brier 0.196
Coef S.E. Wald Z Pr(>|Z|)
Intercept -0.8785 0.3083 -2.85 0.0044
white=0 -0.3819 0.3070 -1.24 0.2136
physhealth 0.0599 0.1174 0.51 0.6102
physhealth' -3.2787 5.5528 -0.59 0.5549
physhealth'' 3.7291 6.2803 0.59 0.5527
income=10- 19k -0.2693 0.2988 -0.90 0.3673
income=20-34K -0.0717 0.3210 -0.22 0.8232
income=35 - 74K -0.3824 0.3469 -1.10 0.2703
income=75K+ 0.0205 0.4022 0.05 0.9593
female=1 0.4861 0.2228 2.18 0.0291
white=0 * female=1 -0.5896 0.4717 -1.25 0.2113
In summary, Model Y is more effective than the augmented model, Model Z, in describing the available set of observations.
The fit quality, as determined by the Nagelkerke \(R^2\) (higher=better), Brier score (lower=better), and C-statistic (higher=better), is slightly improved for the validated values (table 2) for Model Y.
Table 2: correcting for over-optimism through bootstrap validation
Corrected Summary | Model Y value | Model Z value |
---|---|---|
Corrected Nagelkerke \(R^2\) | 0.0107 | -0.0018 |
Corrected Brier Score | 0.2012 | 0.2041 |
Corrected C | 0.5665 | 0.5576 |
Corrected Dxy | 0.1330 | 0.1151 |
Corrected C statistics calculations:
Model Y: 0.5 + (0.1330/2) = 0.5665
Model Z : 0.5 + (0.1151/2) = 0.5576
All of these results came from:
set.seed(1001)
validate(modY_lrm, B = 50)
index.orig training test optimism index.corrected n
Dxy 0.2004 0.2412 0.1737 0.0674 0.1330 50
R2 0.0376 0.0537 0.0268 0.0269 0.0107 50
Intercept 0.0000 0.0000 -0.2668 0.2668 -0.2668 50
Slope 1.0000 1.0000 0.7120 0.2880 0.7120 50
Emax 0.0000 0.0000 0.1274 0.1274 0.1274 50
D 0.0247 0.0364 0.0170 0.0194 0.0053 50
U -0.0035 -0.0035 0.0030 -0.0065 0.0030 50
Q 0.0282 0.0399 0.0140 0.0259 0.0023 50
B 0.1964 0.1943 0.1991 -0.0049 0.2012 50
g 0.4292 0.5058 0.3520 0.1538 0.2754 50
gp 0.0826 0.0969 0.0688 0.0281 0.0545 50
set.seed(1000)
validate(modZ_lrm, B = 50)
index.orig training test optimism index.corrected n
Dxy 0.2210 0.2831 0.1772 0.1059 0.1151 50
R2 0.0428 0.0724 0.0278 0.0446 -0.0018 50
Intercept 0.0000 0.0000 -0.3567 0.3567 -0.3567 50
Slope 1.0000 1.0000 0.6110 0.3890 0.6110 50
Emax 0.0000 0.0000 0.1816 0.1816 0.1816 50
D 0.0284 0.0499 0.0177 0.0322 -0.0038 50
U -0.0035 -0.0035 0.0078 -0.0114 0.0078 50
Q 0.0319 0.0534 0.0099 0.0435 -0.0116 50
B 0.1957 0.1914 0.1998 -0.0084 0.2041 50
g 0.4535 0.5939 0.3568 0.2372 0.2163 50
gp 0.0893 0.1135 0.0703 0.0432 0.0462 50
I prefer the main effects model, Model Y, based on:
Below is a listing of the model parameters for model Y fit to the entire data set (after single imputation) in terms of odds ratios, with 90% confidence intervals
tidy(modY_glm, exponentiate=TRUE, conf.int=TRUE, conf.level = 0.9) %>%
select(term, estimate, conf.low, conf.high) %>% filter(term != "(Intercept)") %>% kable(digits=3)
term | estimate | conf.low | conf.high |
---|---|---|---|
white0 | 0.526 | 0.355 | 0.766 |
physhealth | 1.004 | 0.990 | 1.018 |
income10- 19k | 0.775 | 0.478 | 1.270 |
income20-34K | 0.981 | 0.584 | 1.660 |
income35 - 74K | 0.697 | 0.395 | 1.231 |
income75K+ | 1.019 | 0.524 | 1.962 |
female1 | 1.433 | 1.041 | 1.974 |
We can see that the only coefficients that do not cross 1 and thus may provide some predictive value are white
and female
. I actually find it encouraging that income
was not found to be useful predictor in this case because 8 weeks of HCV treatment costs around $200,000, so it makes me happy that regardless of someone’s income they were equally likely to be treated (by this model and data set). physhealth
also was not helpful in predicting treatment status.
Below are the effect sizes for all elements of model Y both numerically and graphically.
kable(summary(modY_lrm, conf.int=0.90), digits=3)
Low | High | Diff. | Effect | S.E. | Lower 0.9 | Upper 0.9 | Type | |
---|---|---|---|---|---|---|---|---|
physhealth | 0 | 20 | 20 | 0.074 | 0.170 | -0.205 | 0.353 | 1 |
Odds Ratio | 0 | 20 | 20 | 1.077 | NA | 0.815 | 1.423 | 2 |
white - 0:1 | 1 | 2 | NA | -0.643 | 0.234 | -1.027 | -0.258 | 1 |
Odds Ratio | 1 | 2 | NA | 0.526 | NA | 0.358 | 0.772 | 2 |
income - 0-9K:10- 19k | 2 | 1 | NA | 0.255 | 0.296 | -0.232 | 0.743 | 1 |
Odds Ratio | 2 | 1 | NA | 1.291 | NA | 0.793 | 2.102 | 2 |
income - 20-34K:10- 19k | 2 | 3 | NA | 0.236 | 0.257 | -0.187 | 0.659 | 1 |
Odds Ratio | 2 | 3 | NA | 1.266 | NA | 0.829 | 1.933 | 2 |
income - 35 - 74K:10- 19k | 2 | 4 | NA | -0.105 | 0.292 | -0.586 | 0.376 | 1 |
Odds Ratio | 2 | 4 | NA | 0.900 | NA | 0.557 | 1.456 | 2 |
income - 75K+:10- 19k | 2 | 5 | NA | 0.274 | 0.357 | -0.313 | 0.861 | 1 |
Odds Ratio | 2 | 5 | NA | 1.315 | NA | 0.731 | 2.366 | 2 |
female - 1:0 | 1 | 2 | NA | 0.360 | 0.194 | 0.041 | 0.680 | 1 |
Odds Ratio | 1 | 2 | NA | 1.433 | NA | 1.041 | 1.973 | 2 |
Some interesting findings based on the effect sizes shown below
Non-white race is associated with lower odds of treatment compared to white race, which is significant up to the 99 confidence level. Not only is this statistically significant, I think this is also clinically meaningful.
There is no meaningful relationship between income and history of HCV treatment. I think that’s awesome considering how expensive HCV treatment is and the barriers to accessing it.
female patients are more likely to have been treated than male patients (holding all else equal), but this is not significant at the 95th and 99th confidence level.
Physical health does not add any value to predicting whether or not someone has been treated for HCV
plot(summary(modY_lrm))
The white
interpretation is that if we have two patients with a history of hepatitis C, Luke and Leroy, who have the same physhealth
, income
, and sex (female
), but Luke is not White and Leroy is, then model Y projects that Luke’s odds of having been treated for hepatitis C will be 0.526 times Leroy’s odds of having been treated for HCV.
physhealth
, female
, and income
, being non-White appears to be associated with decreasing odds of having been treated for HCV. Note, too, that the effect of white
on the odds of having been treated for HCV has a confidence interval for the odds ratio entirely below 1 (90% CI 0.358 to 0.772).Below is the ROC curve for model Y:
roc.modY <-roc(hcv_l$treated ~ predict(modY_glm, type="response"),
ci = TRUE)
Setting levels: control = 1, case = 0
Setting direction: controls < cases
plot(roc.modY)
Model Y is not a great classifier because it appears very close to the diagonal line we’d see if we were completely guessing (AUC=0.6002).
The validated Nagelkerke R-square and C statistic for model Y, which are 0.0107 and 0.5665, respectively, are also shown in table 2.
Table 2: correcting for over-optimism through bootstrap validation
Corrected Summary | Model Y value | Model Z value |
---|---|---|
Corrected Nagelkerke \(R^2\) | 0.0107 | -0.0018 |
Corrected Brier Score | 0.2012 | 0.2041 |
Corrected C | 0.5665 | 0.5576 |
Corrected Dxy | 0.1330 | 0.1151 |
Corrected C statistics calculations:
Model Y: 0.5 + (0.1330/2) = 0.5665
Model Z : 0.5 + (0.1151/2) = 0.5576
This information was obtained from:
set.seed(1001)
validate(modY_lrm, B = 50)
index.orig training test optimism index.corrected n
Dxy 0.2004 0.2412 0.1737 0.0674 0.1330 50
R2 0.0376 0.0537 0.0268 0.0269 0.0107 50
Intercept 0.0000 0.0000 -0.2668 0.2668 -0.2668 50
Slope 1.0000 1.0000 0.7120 0.2880 0.7120 50
Emax 0.0000 0.0000 0.1274 0.1274 0.1274 50
D 0.0247 0.0364 0.0170 0.0194 0.0053 50
U -0.0035 -0.0035 0.0030 -0.0065 0.0030 50
Q 0.0282 0.0399 0.0140 0.0259 0.0023 50
B 0.1964 0.1943 0.1991 -0.0049 0.2012 50
g 0.4292 0.5058 0.3520 0.1538 0.2754 50
gp 0.0826 0.0969 0.0688 0.0281 0.0545 50
If Harry is a white male with an annual income between 10-19K and had 3 poor physical health days in the past month, then his estimated probability of history of HCV treatment (Pr(treated = 1), or π) is 25.4%
predict(modY_glm, newdata = data.frame(physhealth=3, income = "10- 19k", white = '1', female="0", conf.int=0.9), type = "response", conf.int=0.9)
1
0.2537514
If Sally is a non-white female with an annual income between 20-34K and had 0 poor physical health days in the past month, then her estimated probability of history of HCV treatment (Pr(treated = 1), or π) is 24.3%
predict(modY_glm, newdata = data.frame(physhealth=0, income = "20-34K", white = '0', female="1", conf.int=0.9), type = "response", conf.int=0.9)
1
0.2429642
Below is a nomogram describing model Y.
plot(nomogram(modY_lrm, fun = plogis, funlabel = "Pr(treated)"))
My two research questions were:
In people with hepatitis C, how well does physical health predict the number of days a person drinks in the past month after adjusting for income, sex, and history of HCV treatment?
In people with a history of HCV, how does the odds of receiving HCV treatment for a non-White person compare to the odds of receiving treatment for a person of White race after adjusting for physical health, income, and sex?
Answer to question 1
Physical health does not provide any value in predicting the number of days a person drinks in the past month, after adjusting for income, sex, and history of HCV treatment. The 90% confidence interval for the physical health coefficient crosses 0. If the slope was in fact zero, this would mean that knowing physical health information would be of no additional value in predicting alcohol days overs just guessing the mean of alcohol days. We can also see from the Spearman rho squared plot, that the physical health variable has the least predictive punch of all the variables included in the model.
Answer to question 2
In people with a history of HCV, the odds of a person of non-white receiving HCV treatment, is 0.526 (90% CI 0.355, 0.766) times the odds of a person of white race receiving treatment, holding physical health, income, and sex constant.
Discussion of thoughts
Project 1 allowed me to reflect on my substantial room to grow as a statistician in terms of my views about statistical significance as well as how I use my “researchers degrees of freedom”. I grasped how difficult it is to not make conclusions based on whether an effect was statistically significant. Unfortunately, I used the meaningless terms, “statistically detectable” or “statistically significant” 8 times. I should be focusing on effect size. I stated I was thrilled that income didn’t have an impact on HCV treatment status, which in reality may only be a product of my sample size and how I used my researcher degrees of freedom in categorizing. I wish I have known that BRFSS was not the optimal data source based on my research questions. Because of the limited sample size, lack of many relevant predictors, and lots of missingness, I was confronted with forking paths on how I should impute, which irrelevant ‘covariates’ to include, and how to categorize. However, it’s encouraging that as I learn more, I know how I could improve. For example, major improvements could include using an appropriate analysis for the alcdays
outcome (eg a zero-inflated Poisson) or multiple imputation with MICE rather than areg_impute, which was one of the most confusing parts.
I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.
1.) dataset: https://www.cdc.gov/brfss/smart/smart_2019.html
2.) code book : https://www.cdc.gov/brfss/annual_data/2019/pdf/codebook19_llcp-v2-508.HTML
3.) Ashrafi M, Modabbernia A, Dalir M, Taslimi S, Karami M, Ostovaneh MR, Malekzadeh R, Poustchi H. Predictors of mental and physical health in non-cirrhotic patients with viral hepatitis: a case control study. J Psychosom Res. 2012 Sep;73(3):218-24. doi: 10.1016/j.jpsychores.2012.06.006. Epub 2012 Jul 16. PMID: 22850263.
4.) Marcus JL, Hurley LB, Chamberland S, et al. Disparities in Initiation of Direct-Acting Antiviral Agents for Hepatitis C Virus Infection in an Insured Population. Public Health Rep. 2018;133(4):452-460. doi:10.1177/0033354918772059
xfun::session_info()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8
Package version:
abind_1.4-5 askpass_1.1 assertthat_0.2.1 backports_1.1.9
base64enc_0.1-3 BH_1.72.0.3 blob_1.2.1 boot_1.3.26
broom_0.7.0 callr_3.4.3 car_3.0-9 carData_3.0-4
caret_6.0-86 cellranger_1.1.0 checkmate_2.0.0 class_7.3-18
cli_2.0.2 clipr_0.7.0 cluster_2.1.0 codetools_0.2-18
colorspace_1.4-1 compiler_4.0.4 conquer_1.0.2 cpp11_0.2.1
crayon_1.3.4 crosstalk_1.1.0.1 curl_4.3 data.table_1.13.0
DBI_1.1.0 dbplyr_1.4.4 desc_1.2.0 digest_0.6.25
dplyr_1.0.2 e1071_1.7-3 ellipsis_0.3.1 evaluate_0.14
fansi_0.4.1 farver_2.0.3 forcats_0.5.0 foreach_1.5.0
foreign_0.8-81 Formula_1.2-3 fs_1.5.0 furrr_0.1.0
future_1.18.0 generics_0.0.2 GGally_2.0.0 ggdendro_0.1.21
ggforce_0.3.2 ggformula_0.9.4 ggplot2_3.3.2 ggrepel_0.8.2
ggstance_0.3.4 globals_0.12.5 glue_1.4.2 gower_0.2.2
graphics_4.0.4 grDevices_4.0.4 grid_4.0.4 gridExtra_2.3
gtable_0.3.0 haven_2.3.1 highr_0.8 Hmisc_4.4-1
hms_0.5.3 htmlTable_2.0.1 htmltools_0.5.1.1 htmlwidgets_1.5.1
httr_1.4.2 ipred_0.9-9 isoband_0.2.2 iterators_1.0.12
janitor_2.0.1 jpeg_0.1-8.1 jsonlite_1.7.0
[ reached getOption("max.print") -- omitted 122 entries ]