Load some packages and data from the basic vignette
library("qgcomp")
library("ggplot2")
library("splines")
data("metals", package="qgcomp")
# using data from the intro vignette
Xnm <- c(
'arsenic','barium','cadmium','calcium','chromium','copper',
'iron','lead','magnesium','manganese','mercury','selenium','silver',
'sodium','zinc'
)
covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')
cormat = cor(metals[,Xnm])
idx = which(cormat>0.6 & cormat <1.0, arr.ind = TRUE)
newXnm = unique(rownames(idx)) # iron, lead, and cadmium
qgcomp
package utilizes the Cox proportional
hazards models as the underlying model for time-to-event analysis. The
interpretation of a qgcomp.glm.noboot
fit parameter is a
conditional (on confounders) hazard ratio for increasing all exposures
at once. The qc.survfit1
object demonstrates a time-to-
event analysis with qgcompcox.noboot
. The default plot is
similar to that of qgcompcox.noboot
, in that it yields
weights and an overall mixture effect# non-bootstrapped version estimates a marginal structural model for the
# confounder-conditional effect
survival::coxph(survival::Surv(disease_time, disease_state) ~ iron + lead + cadmium +
arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc +
mage35,
data=metals)
## Call:
## survival::coxph(formula = survival::Surv(disease_time, disease_state) ~
## iron + lead + cadmium + arsenic + magnesium + manganese +
## mercury + selenium + silver + sodium + zinc + mage35,
## data = metals)
##
## coef exp(coef) se(coef) z p
## iron -0.056447 0.945117 0.156178 -0.361 0.7178
## lead 0.440735 1.553849 0.203264 2.168 0.0301
## cadmium 0.023325 1.023599 0.105502 0.221 0.8250
## arsenic -0.003812 0.996195 0.088897 -0.043 0.9658
## magnesium 0.099399 1.104507 0.064730 1.536 0.1246
## manganese -0.014065 0.986033 0.064197 -0.219 0.8266
## mercury -0.060830 0.940983 0.072918 -0.834 0.4042
## selenium -0.231626 0.793243 0.173655 -1.334 0.1823
## silver 0.043169 1.044114 0.070291 0.614 0.5391
## sodium 0.057928 1.059638 0.063883 0.907 0.3645
## zinc 0.057169 1.058835 0.047875 1.194 0.2324
## mage35 -0.458696 0.632107 0.238370 -1.924 0.0543
##
## Likelihood ratio test=23.52 on 12 df, p=0.02364
## n= 452, number of events= 205
qc.survfit1 <- qgcomp.cox.noboot(survival::Surv(disease_time, disease_state) ~ .,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4)
qc.survfit1
## Scaled effect size (positive direction, sum of positive coefficients = 0.32)
## barium zinc magnesium chromium silver sodium iron
## 0.3432 0.1946 0.1917 0.1119 0.0924 0.0511 0.0151
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.554)
## selenium copper calcium arsenic manganese cadmium lead mercury
## 0.2705 0.1826 0.1666 0.1085 0.0974 0.0794 0.0483 0.0466
##
## Mixture log(hazard ratio) (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## psi1 -0.23356 0.24535 -0.71444 0.24732 -0.9519 0.3411
Marginal hazards ratios (and bootstrapped quantile g-computation in
general) uses a slightly different approach to effect estimation that
makes it more computationally demanding than other qcomp
functions.
To estimate a marginal hazards ratio, the underlying model is fit,
and then new outcomes are simulated under the underlying model with a
baseline hazard estimator (Efron’s) - this simulation requires a large
sample (controlled by MCsize) for accuracy. This approach is similar to
other g-computation approaches to survival analysis, but this approach
uses the exact survival times, rather than discretized survival times as
are common in most g-computation analysis. Plotting a
qgcompcox.boot
object yields a set of survival curves
(e.g.qc.survfit2
) which comprise estimated survival curves
(assuming censoring and late entry at random, conditional on covariates
in the model) that characterize conditional survival functions
(i.e. censoring competing risks) at various levels of joint-exposure
(including the overall average - which may be slightly different from
the observed survival curve, but should more or less agree).
# bootstrapped version estimates a marginal structural model for the population average effect
#library(survival)
qc.survfit2 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ .,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4,
B=5, MCsize=1000, parallel=TRUE, parplan=TRUE)
qc.survfit2
## Mixture log(hazard ratio) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## psi1 -0.21232 0.26320 -0.72819 0.30355 -0.8067 0.4199
# testing proportional hazards (note that x=TRUE is not needed (and will cause an error if used))
survival::cox.zph(qc.survfit2$fit)
## chisq df p
## arsenic 0.18440 1 0.668
## barium 0.10819 1 0.742
## cadmium 0.05345 1 0.817
## calcium 0.00206 1 0.964
## chromium 1.23974 1 0.266
## copper 0.28518 1 0.593
## iron 3.46739 1 0.063
## lead 0.17575 1 0.675
## magnesium 2.12900 1 0.145
## manganese 0.58720 1 0.444
## mercury 0.00136 1 0.971
## selenium 0.15247 1 0.696
## silver 0.01040 1 0.919
## sodium 0.09352 1 0.760
## zinc 1.51261 1 0.219
## GLOBAL 9.82045 15 0.831
p2 = plot(qc.survfit2, suppressprint = TRUE)
p2 + labs(title="Linear log(hazard ratio), overall and exposure specific")
All bootstrapped functions in qgcomp
allow
parellelization via the parallel=TRUE parameter (demonstrated with the
non-liner fit in qc.survfit3
). Only 5 bootstrap iterations
are used here, which is not nearly enough for inference, and will
actually be slower for parallel processing due to some overhead when
setting up the parallel processes.
qc.survfit3 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4,
B=5, MCsize=1000, parallel=TRUE, parplan=TRUE)
qc.survfit3
## Mixture log(hazard ratio) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## psi1 -0.18005 0.63284 -1.4204 1.0603 -0.2845 0.776
p3 = plot(qc.survfit3, suppressprint = TRUE)
p3 + labs(title="Non-linear log(hazard ratio) overall, linear exposure specific ln-HR")
Technical Note: this mode of usage is designed for
simplicity. The implementation relies on the future
and
future.apply
packages. Use guidelines of the
future
package dictates that the user should be able to
control the future “plan”, rather than embedding it in functions as has
been done here. This slightly more advanced usage (which allows nesting
within larger parallel schemes such as simulations) is demonstrated here
by setting the “parplan” parameter to FALSE and explicitly specifying a
“plan” outside of qgcomp functions. This will move much of the overhead
due to parallel processing outside of the actual qgcomp functions. The
final code block of this vignette shows how to exit the “plan” and
return to standard evaluation via plan(sequential)
- doing
so at the end means that the next parallel call (with parplan=FALSE) we
make will have lower overhead and run slightly faster.
future::plan(future::multisession)# parallel evaluation
qc.survfit3 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4,
B=5, MCsize=1000, parallel=TRUE, parplan=FALSE)
qc.survfit3
## Mixture log(hazard ratio) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## psi1 -0.20352 0.36893 -0.92662 0.51958 -0.5516 0.5812
p3 = plot(qc.survfit3, suppressprint = TRUE)
p3 + labs(title="Non-linear log(hazard ratio) overall, linear exposure specific ln-HR")
Returning to substance: while qgcompcox.boot
fits a
smooth hazard ratio function, the hazard ratios contrasting specific
quantiles with a referent quantile can be obtained, as demonstrated with
qc.survfit4
. As in qgcomp.glm.boot
plots, the
conditional model fit and the MSM fit are overlaid as a way to judge how
well the MSM fits the conditional fit (and whether, for example
non-linear terms should be added or removed from the overall fit via the
degree parameter - we note here that we know of no statistical test for
quantifying the difference between these lines, so this is up to user
discretion and the plots are provided as visuals to aid in exploratory
data analysis).
qc.survfit4 <- qgcomp.cox.boot(Surv(disease_time, disease_state) ~ . + .^2,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state')], q=4,
B=5, MCsize=1000, parallel=TRUE, parplan=FALSE, degree=2)
## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 1,2 ; coefficient may be infinite.
## Mixture log(hazard ratio) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## psi1 -2.76486 13.83940 -29.890 24.3599 -0.1998 0.8417
## psi2 0.86128 4.63390 -8.221 9.9436 0.1859 0.8526
# examining the overall hazard ratio as a function of overall exposure
hrs_q = exp(matrix(c(0,0,1,1,2,4,3,9), ncol=2, byrow=TRUE)%*%qc.survfit4$msmfit$coefficients)
colnames(hrs_q) = "Hazard ratio"
print("Hazard ratios by quartiles (min-25%,25-50%, 50-75%, 75%-max)")
## [1] "Hazard ratios by quartiles (min-25%,25-50%, 50-75%, 75%-max)"
## Hazard ratio
## [1,] 1.0000000
## [2,] 0.1490343
## [3,] 0.1243570
## [4,] 0.5809675
p4 = plot(qc.survfit4, suppressprint = TRUE)
p4 + labs(title="Non-linear log(hazard ratio), overall and exposure specific")
Testing proportional hazards is somewhat complicated with respect to
interpretation. Consider first a linear fit from
qgcomp.cox.noboot
. Because the underlying model of a linear
qgcomp fit is equivalent to the sum of multiple parameters, it is not
clear how proportional hazards might be best tested for the mixture. One
could examine test statistics for each exposure, but there may be some
exposures for which the test indicates non-proportional hazards and some
for which the test does not.
The “GLOBAL” test in the cox.zph function from the
survival
package comes closest to what we might want, and
gives an overall assessment of non-proportional hazards for all model
terms simultaneously (including non-mixture covariates). While this
seems somewhat undesirable due to non-specificity, it is not necessarily
important that only the mixture have proportional hazards, so it is
useful and easily interpretable to have a global test of fit via
GLOBAL.
# testing proportional hazards (must set x=TRUE in function call)
qc.survfit1ph <- qgcomp.cox.noboot(survival::Surv(disease_time, disease_state) ~ .,expnms=Xnm,
data=metals[,c(Xnm, 'disease_time', 'disease_state', "mage35")], q=4,
x=TRUE)
survival::cox.zph(qc.survfit1ph$fit)
## chisq df p
## arsenic 1.57e-01 1 0.691
## barium 1.28e-01 1 0.721
## cadmium 5.14e-02 1 0.821
## calcium 9.16e-04 1 0.976
## chromium 1.25e+00 1 0.263
## copper 3.42e-01 1 0.559
## iron 3.51e+00 1 0.061
## lead 1.59e-01 1 0.690
## magnesium 2.08e+00 1 0.149
## manganese 5.78e-01 1 0.447
## mercury 4.87e-06 1 0.998
## selenium 1.32e-01 1 0.716
## silver 1.30e-02 1 0.909
## sodium 1.06e-01 1 0.745
## zinc 1.53e+00 1 0.216
## mage35 1.73e-02 1 0.895
## GLOBAL 9.93e+00 16 0.870
For a potentially non-linear/ non-additive fit from
qgcomp.cox.boot
, the issue is slightly more complicated by
the fact that the algorithm will fit both the underlying model and a
marginal structural model using the predictions from the underlying
model. In order for the predictions to yield valid causal inference, the
underlying model must be correct (which implies that proportional
hazards hold). The marginal structural model proceeds assuming the
underlying model is correct. Currently there is no simple way to allow
for non-proportional hazards in the marginal structural model, but
non-proportional hazards can be implemented in the conditional model via
standard approaches to non-proportional hazards such as
time-exposure-interaction terms. This is a rich field and discussion is
beyond the scope of this document.
# testing global proportional hazards for model (note that x=TRUE is not needed (and will cause an error if used))
phtest3 = survival::cox.zph(qc.survfit3$fit)
phtest3$table[dim(phtest3$table)[1],, drop=FALSE]
## chisq df p
## GLOBAL 206.5578 120 1.53907e-06
Late entry and counting-process style data will currently yield results in qgcomp.cox functions. There has been some testing of this in limited settings, but we note that this is still an experimental feature at this point that may not be valid in all cases and so it is not documented here. As much effort as possible to validate results through other means is needed when using qgcomp in data subject to late-entry or when using counting-process style data.
Clustering on the individual or group level means that there are
individual or group level characteristics which result in covariance
between observations (e.g. within individual variance of an outcome may
be much lower than the between individual variance). For linear models,
the error term is assumed to be independent between observations, and
clustering breaks this assumption. Ways to relax this assumption include
empirical variance estimation and cluster-appropriate robust variance
estimation (e.g. through the sandwich
package in R).
Another way is to use cluster-based bootstrapping, which samples
clusters, rather than individual observations.
qgcomp.glm.boot
and qgcomp.glm.ee
can be
leveraged to produce clustering consistent estimates of standard errors
for independent effects of exposure as well as the effect of the
exposure as a whole. This is done using the id
parameter of
qgcomp.glm.boot/ee
(which can only handle a single variable
and so may not efficient for nested clustering, for example).
Below is a simple example with one simulated exposure. First the
exposure data are ‘pre-quantized’ (so that one can verify that standard
errors are appropriate using other means - this is not intended to show
a suggested practice). Next the data are analyzed using a 1-component
mixture in qgcomp - again, this is for verification purposes. The
qgcomp.glm.noboot
result yields a naive standard error of
0.0310 for the mixture effect:
set.seed(2123)
N = 250
t = 4
dat <- data.frame(row.names = 1:(N*t))
dat <- within(dat, {
id = do.call("c", lapply(1:N, function(x) rep(x, t)))
u = do.call("c", lapply(1:N, function(x) rep(runif(1), t)))
x1 = rnorm(N, u)
y = rnorm(N) + u + x1
})
# pre-quantize
expnms = c("x1")
datl = quantize(dat, expnms = expnms)
qgcomp.glm.noboot(y~ x1, data=datl$dat, family=gaussian(), q = NULL)
## Including all model terms as exposures of interest
## Scaled effect size (positive direction, sum of positive coefficients = 0.955)
## x1
## 1
##
## Scaled effect size (negative direction, sum of negative coefficients = 0)
## None
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.463243 0.057934 -0.57679 -0.34969 -7.996 3.553e-15
## psi1 0.955015 0.031020 0.89422 1.01581 30.787 < 2.2e-16
# neither of these ways yields appropriate clustering
#qgcomp.glm.noboot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL)
#qgcomp.glm.boot(y~ x1, data=datl$dat, family=gaussian(), q = NULL, MCsize=1000)
while the qgcomp.glm.boot
result (not run, but shown in
a comment below MCsize=5000, B=500) yields a corrected standard error of
0.0398, which is close to the robust standard error estimate of 0.0409
of qgcomp.glm.ee
. Both of these methods can provide
clustering-appropriate inference, wheras uncorrected estimates from
qgcomp.glm.noboot
cannot. The standard errors from the
uncorrected fit are too low, but this may not always be the case.
Similarly, use of an external package to estimate the sandwich (robust)
standard error estimate of 0.0409 is shown below, which works here only
because the mixture only has a single exposure. It should be noted that
the sandwich variance estimator is for a conditional model parameter,
wheras the estimate from qgcomp.glm.ee
is from an MSM, so
results will differ from standard sandwich estimators when there are
covariates or multiple exposures.
# clustering by specifying id parameter on
#qgcomp.glm.boot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL, MCsize=1000, B = 5)
eefit <- qgcomp.glm.ee(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL)
## Including all model terms as exposures of interest
##
## Note: using all possible values of exposure as the
## intervention values
## Mixture slope parameters (robust CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.463243 0.078483 -0.61707 -0.30942 -5.9024 4.905e-09
## psi1 0.955015 0.040831 0.87499 1.03504 23.3894 < 2.2e-16
#qgcomp.glm.boot(y~ x1, data=datl$dat, id="id", family=gaussian(), q = NULL, MCsize=1000, B = 500)
# Mixture slope parameters (bootstrap CI):
#
# Estimate Std. Error Lower CI Upper CI t value
# (Intercept) -0.4632 0.0730 -0.606 -0.32 3.3e-10
# psi1 0.9550 0.0398 0.877 1.03 0
# This can be verified using the `sandwich` package
#fitglm = glm(y~x1, data=datl$dat)
#sw.cov = sandwich::vcovCL(fitglm, cluster=~id, type = "HC0")[2,2]
#sqrt(sw.cov)
# [1] 0.0409
Returning to our original example (and adjusting for covariates), note that the default output for a qgcomp.*.noboot object includes “sum of positive/negative coefficients.” These can be interpreted as “partial mixture effects” or effects of exposures with coefficients in a particular direction. This is displayed graphically via a plot of the qgcomp “weights,” where all exposures that contribute to a given partial effect are on the same side of the plot. Unfortunately, this does not yield valid inference for a true partial effect because it is a parameter conditional on the fitted results and thus does not represent the type of a priori hypothesis that is amenable to hypothesis testing and confidence intervals. Another way to think about this is that it is a data adaptive parameter and thus is subject to issues of overfit that are similar to issues with making inference from step-wise variable selection procedures.
(qc.fit.adj <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=Xnm, family=gaussian()))
## Scaled effect size (positive direction, sum of positive coefficients = 0.441)
## calcium barium iron arsenic silver chromium cadmium zinc
## 0.76111 0.06133 0.05854 0.02979 0.02433 0.02084 0.01395 0.01195
## mercury sodium
## 0.01130 0.00688
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.104)
## copper magnesium lead manganese selenium
## 0.44654 0.42124 0.09012 0.03577 0.00631
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.460408 0.134007 -0.72306 -0.19776 -3.4357 0.0006477
## psi1 0.337539 0.089358 0.16240 0.51268 3.7774 0.0001805
Fortunately, there is a way towards estimation of “partial effects.”
One way to do this is sample splitting, where the data are first
randomly partitioned into a “training” and a “validation” data set. The
assessment of whether a coefficient is positive or not occurs in the
“training” set and then effect estimation for positive/negative partial
effects occurs in the validation data. Basic simulations can show that
such a procedure can yield valid (often called “honest”) hypothesis
tests and confidence intervals for partial effects, provided that there
is no separate data exploration in the combined dataset to select the
models. In the qgcomp
package, the partitioning of the
datasets into “training” and “validation” sets is done by the user,
which prevents issues that may arise if this is done naively on a
dataset that contains clusters (where we should partition based on
clusters, rather than observations) or multiple observations per
individual (all observations from an individual should be partitioned
together). Here is an example of simple partitioning on a dataset that
contains one observation per individual with no clustering. The downside
of sample splitting is the loss of precision, because the final
“validation” dataset comprises only a fraction of the original sample
size. Thus, the estimation of partial effects is most appropriate with
large sample sizes. We also note that these partial effect are only well
defined when all effects are linear and additive, since whether a
variable contributes to a positive or negative partial effect would
depend on the value of that variable, so the valid estimation of
“partial effects” is limited to settings in which the
qgcomp.\*.noboot
objects are used for inference.
# 40/60% training/validation split
set.seed(123211)
trainidx <- sample(1:nrow(metals), round(nrow(metals)*0.4))
valididx <- setdiff(1:nrow(metals),trainidx)
traindata <- metals[trainidx,]
validdata <- metals[valididx,]
dim(traindata) # 40% of total
## [1] 181 26
## [1] 271 26
The qgcomp package then facilitates the analysis of these partitioned
data to allow valid estimation and hypothesis testing of partial
effects. The qgcomp.partials
function is used to estimate
partial effects. Note that the variables with “negative effect sizes”
differs slightly from the overall analysis given in the
qc.fit
object that represents our first pass analysis on
these data. This is to be expected, and is a feature of this approach:
different random subsets of the data will be expected to yield different
estimated effects. If the true effect is null, then the estimated
effects will vary from positive to negative around the null, and sample
splitting is an important way to distinguish between estimates that
reflect underlying patterns in the entire dataset from estimates that
are simply due to natural sampling variation inherent to small and
moderate samples. Note that fitting on these smaller datasets can
sometimes result in perfect collinearity of exposures, in which case
setting bayes=TRUE
may be necessary to apply a ridge
penalty to estimates.
splitres <- qgcomp.partials(
fun="qgcomp.glm.noboot", f=y~., q=4,
traindata=traindata[,c(Xnm, covars, "y")],validdata=validdata[,c(Xnm, covars, "y")], expnms=Xnm,
bayes=FALSE,
.fixbreaks = TRUE, .globalbreaks=FALSE
)
splitres
##
## Variables with positive effect sizes in training data: arsenic, barium, cadmium, calcium, chromium, iron, selenium, silver, sodium, zinc
## Variables with negative effect sizes in training data: copper, lead, magnesium, manganese, mercury
## Partial effect sizes estimated in validation data
## Positive direction Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.46347 0.16202 -0.78102 -0.14591 -2.8605 0.004573
## psi1 0.35150 0.10849 0.13887 0.56413 3.2400 0.001351
##
## Negative direction Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.020164 0.097018 -0.210316 0.16999 -0.2078 0.8355
## psi1 0.028852 0.063752 -0.096101 0.15380 0.4526 0.6512
Consistent with our overall results, the overall effect of metals on
the outcome y
is predominantly positive, which is driven
mainly by calcium. The partial positive effect of psi=0.42 is slightly
attenuated from the partial positive effect given in the original fit
(0.44), but is slightly larger than the overall effect from the original
fit (psi1=0.34). We note that the effect direction of cadmium is
negative, even though it was selected based on positive associations in
the training data. This suggests this variable has effects that are
close to the null and their direction will depend on which subset of the
data are used. This feature allows valid testing of hypotheses - a
global null
in which no exposures have effects will be
characterized by variables that randomly switch effect directions
between training and validation datasets, which will yield partial
effect estimates close to the null with hypothesis tests that have
appropriate type-1 error rates in large datasets.
By default (subject to change) quantile cut points (“breaks”) are defined within the training data and applied to the validation data. You may also change this behavior to allow the breaks to be defined using quantiles from the entire dataset, which treats the quantiles as fixed. This will be expected to improve stability in small samples and may eventually replace the default behavior as the quantiles themselves are not generally treated as random variables within quantile g-computation. For this particular dataset (and seed value), there is little impact of this setting on the results.
splitres_alt <- qgcomp.partials(
fun="qgcomp.glm.noboot", f=y~., q=4,
traindata=traindata[,c(Xnm, covars, "y")],validdata=validdata[,c(Xnm, covars, "y")], expnms=Xnm,
bayes=FALSE,
.fixbreaks = TRUE, .globalbreaks=TRUE
)
splitres_alt
##
## Variables with positive effect sizes in training data: arsenic, barium, cadmium, calcium, chromium, iron, selenium, silver, sodium, zinc
## Variables with negative effect sizes in training data: copper, lead, magnesium, manganese, mercury
## Partial effect sizes estimated in validation data
## Positive direction Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.45550 0.16996 -0.78862 -0.12238 -2.6800 0.007832
## psi1 0.34492 0.11364 0.12219 0.56766 3.0352 0.002648
##
## Negative direction Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.0079779 0.0981166 -0.200283 0.18433 -0.0813 0.9353
## psi1 0.0328941 0.0648437 -0.094197 0.15999 0.5073 0.6124
One careful note: when there are multiple exposures with small positive or negative effects, the partial effects may be biased towards the null in studies with moderate or small sample sizes. This occurs because, in the training set, some exposures with small effects are likely to be mis-classified with regard to their effect direction. In some instances, both the positive and negative partial effects can be in the same direction. This occurs if individual effects are predominantly in one direction, but some are small and subject to having mis-classified directions. As one example: if there is a null overall effect, but there is a positive partial effect driven strongly by one exposure and a balancing negative partial effect driven by numerous weaker associations, partial effect estimates will not sum to the overall effect because the negative partial effect will experience more downward bias in typical sample sizes. Thus, when the overall effect does not equal the sum of the partial effects, there is likely some bias in at least one of the partial effect estimates. This is not a unique feature of quantile-based g-computation, but is also be a concern for methods that focus on estimation of partial effects, such as weighted quantile sum regression.
The larger question about interpretation (and its worth) of partial effects is left to the analyst. For large datasets with well characterized exposures that have plausible subsets of exposures that would be positively/negatively linearly associated with the outcome, the variables that partition into negative/positive partial effects may make some substantive sense. In more realistic settings that typify exposure mixtures, the partitioning will result in groups that don’t entirely make sense. The “partial effect” yields the effect of increasing all exposures in the subset defined by positive coefficients in the training data, while holding all other exposures and confounders constant. In the setting where this corresponds to real world patterns (e.g. all exposures in the positive partial effect share a source), then this may be interpretable roughly as the effect of an action to intervene on the source of these exposures. In most settings, however, interpretation will not be this clear and should not be expected to map onto potential real-world interventions. We note that this is not a function of the quantile g-computation method, but just part of the general messiness of working with exposures mixture data.
A more justifiable approach in terms of mapping effect estimates onto
potential real-world actions would be choosing subsets of exposures
based on prior subject matter knowledge, as we demonstrated above in
example 5. This does not require sample splitting, but it does come at
the expense of having to know more about the exposures and outcome under
analysis. For example, our simulated outcome y
may
represent some outcome that we would expect to increase with so-called
“essential” metals, or those that are necessary (at some small amount)
for normal human functioning, but it may decrease with “potentially
toxic” (non-essential) metals, or those that have no known biologic
function and are more likely to cause harm rather than improve
physiologic processes that lead to improved (larger) values of
y
. Qgcomp can be used to assess effects of these
“sub-mixtures.”
Here are results for the essential metals:
nonessentialXnm <- c(
'arsenic','barium','cadmium','chromium','lead','mercury','silver'
)
essentialXnm <- c(
'sodium','magnesium','calcium','manganese','iron','copper','zinc','selenium'
)
covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')
(qc.fit.essential <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=essentialXnm, family=gaussian()))
## Scaled effect size (positive direction, sum of positive coefficients = 0.357)
## calcium iron zinc selenium
## 0.908914 0.077312 0.013128 0.000646
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.0998)
## copper magnesium sodium manganese
## 0.4872 0.4304 0.0486 0.0338
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.340130 0.111562 -0.55879 -0.12147 -3.0488 0.002435
## psi1 0.257136 0.074431 0.11125 0.40302 3.4547 0.000604
Here are results for the non-essential metals:
(qc.fit.nonessential <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, covars, 'y')], expnms=nonessentialXnm, family=gaussian()))
## Scaled effect size (positive direction, sum of positive coefficients = 0.0751)
## barium arsenic silver mercury cadmium
## 0.2827 0.2494 0.2420 0.1763 0.0496
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.0129)
## lead chromium
## 0.888 0.112
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.055549 0.081509 -0.215304 0.10420 -0.6815 0.4959
## psi1 0.062267 0.052508 -0.040648 0.16518 1.1858 0.2363
As shown from these results, the essential metals and minerals demonstrate an overall positive joint association with the outcome (controlling for non-essentials), whereas the partial effect of non-essential metals and minerals is close to null. This is close to the interpretation of the data adaptive selection of partial effects demonstrated above, but is interpretable in terms of how we might intervene (e.g. increase consumption of foods that are higher in essential metals and minerals).
For outcomes modeled as 3+ discrete categories, qgcomp joint effect estimates are interpreted as a ratio of the probability of being in the referent category of the outcome and the probability of being in the index category.
First, we’ll bring in data and create a multinomial outcome.
data("metals") # from qgcomp package
# create categorical outcome from the existing continuous outcome (usually, one will already exist)
metals$ycat = factor(quantize(metals, "y",q=4)$data$y, levels=c("0", "1", "2", "3"), labels=c("cct", "ccg", "aat", "aag"))
# restrict to smaller dataset for simplicity
smallmetals = metals[,c("ycat", "arsenic", "lead", "cadmium", "mage35")]
Next, fit the model.
### 1: Define mixture and underlying model ####
mixture = c("arsenic", "lead", "cadmium")
f2 = ycat ~ arsenic + lead + cadmium + mage35
rr = qgcomp.multinomial.noboot(
f2,
expnms = mixture,
q=4,
data = smallmetals,
)
rr2 = qgcomp.multinomial.boot(
f2,
expnms = mixture,
q=4,
data = smallmetals,
B =2, # set to higher values >200 in general usage
MCSize=10000 # set to higher values in small samples
)
summary(rr)
## Reference outcome levels:
## cct ccg aat aag
## Weights
## arsenic lead cadmium
## ccg -0.19985072 -0.4065664 -0.3935829
## aat -0.02996557 1.0000000 -0.9700344
## aag -0.36389597 -0.3920415 -0.2440625
##
## Sum of positive coefficients
## ccg aat aag
## 0.000000000 0.006243471 0.000000000
## Sum of negative coefficients
## ccg aat aag
## -0.3255537 -0.1449220 -0.2362489
##
## Mixture slope parameters (Standard CI):
## Estimate Std. Error
## ccg.(Intercept) 0.341521 0.324928
## aat.(Intercept) 0.098838 0.330695
## aag.(Intercept) 0.261681 0.326332
## ccg.psi -0.325554 0.204573
## aat.psi -0.138679 0.203451
## aag.psi -0.236249 0.203446
## Reference outcome levels:
## cct ccg aat aag
## Mixture slope parameters (Bootstrap CI):
## Estimate Std. Error
## ccg.(Intercept) 0.294923 0.045838
## aat.(Intercept) -0.009291 0.068102
## aag.(Intercept) 0.193000 0.076732
## ccg.psi1 -0.238836 0.012265
## aat.psi1 -0.058937 0.051478
## aag.psi1 -0.197627 0.197277
Some of the convenience functions, like
plot
and
summary
are available for multinomial fits, and more
functionality will be available in the future.
Often, we will want to apply sampling weights to data to make
inference to a population that is either fully external to the analytic
data or represents a larger population of which the analytic data are a
subset. Weighting (e.g. sampling weights or inverse-odds weights) may be
used for inference. This is common in analyses of National Health and
Nutrition Examination Survey (NHANES) data, which provides a rich source
of data for analysis of mixtures. Weighting is done using the “weights”
parameter in calls to qgcomp methods. Notably, the default in
qgcomp.*.noboot
interprets these weights as frequency
weights in which one individual represents multiple known individuals
with identical data. Sampling weights, however, are not frequency
weights in the strict sense, and should be handled differently.
Survey-weighting methods can be used in general, but within the qgcomp
package the appropriate functions are the qgcomp.*.ee
functions, which use robust standard errors that address uncertainty due
to sampling weights. Bootstrapping is another alternative.
### 1: Define mixture and underlying model ####
set.seed(12321)
metals$samplingweights = exp(rnorm(nrow(smallmetals), -0.5, 1))
uw = qgcomp.glm.noboot(y~., expnms = Xnm,q=4, data = metals[,c(Xnm, covars, "y")])
wtd = qgcomp.glm.noboot(y~., expnms = Xnm,q=4, data = metals[,c(Xnm, covars, "y")], weights=metals$samplingweights)
wtd2 = qgcomp.glm.ee(y~., expnms = Xnm,q=4, data = metals[,c(Xnm, covars, "y")], weights=metals$samplingweights)
#wtd3 = qgcomp.glm.boot(y~., expnms = Xnm,q=4, data = metals[,c(Xnm, covars, "y")], weights=metals$samplingweights)
When carrying out data analysis using quantile g-computation, on can
address missing data in much the same way as in standard regression
analyses. A common approach is complete case analysis. While regression
functions in R will automatically carry out complete case analyses when
variables take on the value NA
(denoting missingness in R),
when using quantile g-computation it is encouraged that one explicitly
create the complete case dataset explicitly and use that complete case
dataset. Using the pre-installed R packages, this can be accomplished
with the complete.cases
function.
The reason for this recommendation is that, while the regression
analysis will be performed on complete cases (i.e. observations with
non-missing values for all variables in the model), the calculation of
quantiles for each exposures is done on an exposure-by-exposure basis,
which can lead to numerical differences when explicitly using a dataset
restricted to complete cases versus relying on automatically removing
observations with NA
values during analysis.
Here is an artificial example that demonstrates the differences.
Three analyses are run: one with the full data, one with complete case
data (complete case analysis #1), and one with data in which
arsenic
has been randomly set to NA
(complete
case analysis #2).
There are numeric differences between the two complete case analyses,
which can be traced to differences in the “quantized” exposures (other
than arsenic) in the two approaches, which can be found by the
qx
data frame that is part of a qgcompfit
object.
Xnm <- c(
'arsenic','barium','cadmium','calcium','chromium','copper',
'iron','lead','magnesium','manganese','mercury','selenium','silver',
'sodium','zinc'
)
covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')
asmiss = metals
set.seed(1232)
asmiss$arsenic = ifelse(runif(nrow(metals))>0.7, NA, asmiss$arsenic)
cc = asmiss[complete.cases(asmiss[,c(Xnm, covars, "y")]),] # complete.cases gives a logical index to subset rows
dim(metals) # [1] 452 26
## [1] 452 28
## [1] 320 28
Here we have results from the full data (for comparison purposes)
qc.base <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=metals[,c(Xnm, covars, 'y')], family=gaussian())
cat("Full data\n")
## Full data
## Scaled effect size (positive direction, sum of positive coefficients = 0.441)
## calcium barium iron arsenic silver chromium cadmium zinc
## 0.76111 0.06133 0.05854 0.02979 0.02433 0.02084 0.01395 0.01195
## mercury sodium
## 0.01130 0.00688
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.104)
## copper magnesium lead manganese selenium
## 0.44654 0.42124 0.09012 0.03577 0.00631
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.460408 0.134007 -0.72306 -0.19776 -3.4357 0.0006477
## psi1 0.337539 0.089358 0.16240 0.51268 3.7774 0.0001805
Here we have results from a complete case analysis in which we have set some exposures to be missing and have explicitly excluded data that will be dropped from analysis.
qc.cc <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=cc[,c(Xnm, covars, 'y')], family=gaussian())
cat("Complete case analyses\n")
## Complete case analyses
## #1 explicitly remove observations with missing values
## Scaled effect size (positive direction, sum of positive coefficients = 0.48)
## calcium iron zinc chromium silver lead barium arsenic
## 0.78603 0.07255 0.04853 0.03315 0.02555 0.01405 0.01285 0.00728
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.117)
## copper magnesium cadmium mercury manganese sodium selenium
## 0.52200 0.23808 0.09082 0.06570 0.04105 0.03921 0.00315
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.509402 0.145471 -0.79452 -0.22428 -3.5017 0.0005315
## psi1 0.362582 0.096696 0.17306 0.55210 3.7497 0.0002119
Finally we have results from a complete case analysis in which we have set some exposures to be missing, but we rely on R’s automated dropping of observations with missing values.
qc.cc2 <- qgcomp.glm.noboot(y~.,expnms=Xnm, dat=asmiss[,c(Xnm, covars, 'y')], family=gaussian())
cat(" #1 rely on R handling of NA values\n")
## #1 rely on R handling of NA values
## Scaled effect size (positive direction, sum of positive coefficients = 0.493)
## calcium iron zinc chromium barium silver lead selenium
## 0.76495 0.07191 0.04852 0.03489 0.02151 0.02119 0.01912 0.01444
## arsenic
## 0.00348
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.12)
## copper magnesium sodium manganese cadmium mercury
## 0.5338 0.2139 0.0965 0.0853 0.0501 0.0204
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.51673 0.15001 -0.81074 -0.22273 -3.4447 0.0006520
## psi1 0.37249 0.10014 0.17621 0.56876 3.7196 0.0002376
Now we can see a reason for the discrepancy between the methods above: when relying on R to drop missing values by allowing missing values for exposures in the analytic data, the quantiles of exposures will be done on all valid values for each exposure. In the complete case data, the quantiles will only be calculated among those with complete observations. The latter will generally be preferred because the quantiles for each exposure will be calculated on the same sample of individuals.
# calculation of arsenic quantiles is identical
all.equal(qc.cc$qx$arsenic_q, qc.cc2$qx$arsenic_q[complete.cases(qc.cc2$qx$arsenic_q)])
## [1] TRUE
# all are equal
all.equal(qc.cc$qx$cadmium_q, qc.cc2$qx$cadmium_q[complete.cases(qc.cc2$qx$arsenic_q)])
## [1] "Mean relative difference: 0.3823529"
A common form of missing data that occurs in mixtures are exposure
values that are missing due to being below the limit of detection. A
common approach to such missing data is imputation, either through
filling in small numeric values in place of the missing values, or in a
more formal multiple imputation from a parametric model. Notably, with
quantile g-computation, if the proportion of values below the limit of
detection is below 1/q (the number of quantiles), all appropriate
missing data approaches will yield the same answer. Thus, if one has 3
exposures each with 10% of the values below the limit of detection, one
can impute small values below those limits (e.g. limit of detection
divided by the square root of 2) and proceed with quantile g-computation
on the imputed data. This analysis leverages the fact that, even if a
value below the limit of detection cannot be known with certainty, the
category score used in quantile g-computation is known with certainty.
In cases with more than 1/q% measurements below the LOD, the packages
qgcomp
comes with a convenience function
mice.impute.leftcenslognorm
that can be used to impute
values below the limit of detection from a left censored log-normal
regression model.
Multiple imputation uses multiple datasets with different imputed
values for each missing value across datasets. Separate analyses are
performed on each of these datasets, and the results are combined using
standard rules. The function mice.impute.leftcenslognorm
can be interfaced with the mice
package for efficient
programming of multiple imputation based analysis with quantile
g-computation. Examples cannot be included here without explicitly
installing the mice
package, but an example can be seen in
the help file for mice.impute.leftcenslognorm
.
Alexander P. Keil, Jessie P. Buckley, Katie M. O’Brien, Kelly K. Ferguson, Shanshan Zhao,Alexandra J. White. A quantile-based g-computation approach to addressing the effects of exposure mixtures. https://doi.org/10.1289/EHP5838