Example 2: conditional odds ratio, marginal odds ratio in a logistic model
Example 5: comparing model fits and further exploring non-linearity
qgcomp
is a package to implement g-computation for
analyzing the effects of exposure mixtures. Quantile g-computation
yields estimates of the effect of increasing all exposures by one
quantile, simultaneously. This, it estimates a “mixture effect” useful
in the study of exposure mixtures such as air pollution, diet, and water
contamination.
Using terminology from methods developed for causal effect estimation, quantile g-computation estimates the parameters of a marginal structural model that characterizes the change in the expected potential outcome given a joint intervention on all exposures, possibly conditional on confounders. Under the assumptions of exchangeability, causal consistency, positivity, no interference, and correct model specification, this model yields a causal effect for an intervention on the mixture as a whole. While these assumptions may not be met exactly, they provide a useful road map for how to interpret the results of a qgcomp fit, and where efforts should be spent in terms of ensuring accurate model specification and selection of exposures that are sufficient to control co-pollutant confounding.
Say we have an outcome Y, some exposures 𝕏 and possibly some other covariates (e.g. potential confounders) denoted by ℤ.
The basic model of quantile g-computation is a joint marginal structural model given by
𝔼(YXq|Z, ψ, η) = g(ψ0 + ψ1Sq + ηZ)
where g(⋅) is a link function in a generalized linear model (e.g. the inverse logit function in the case of a logistic model for the probability that Y = 1), ψ0 is the model intercept, η is a set of model coefficients for the covariates and Sq is an “index” that represents a joint value of exposures. Quantile g-computation (by default) transforms all exposures X into Xq, which are “scores” taking on discrete values 0,1,2,etc. representing a categorical “bin” of exposure. By default, there are four bins with evenly spaced quantile cutpoints for each exposure, so Xq = 0 means that X was below the observed 25th percentile for that exposure. The index Sq represents all exposures being set to the same value (again, by default, discrete values 0,1,2,3). Thus, the parameter ψ1 quantifies the expected change in the outcome, given a one quantile increase in all exposures simultaneously, possibly adjusted for Z.
There are nuances to this particular model form that are available in
the qgcomp
package which will be explored below. There
exists one special case of quantile g-computation that leads to fast
fitting: linear/additive exposure effects. Here we simulate
“pre-quantized” data where the exposures X1, X2, X3
can only take on values of 0,1,2,3 in equal proportions. The model
underlying the outcomes is given by the linear regression:
𝔼(Y|X, β) = β0 + β1X1 + β2X2 + β3X3
with the true values of β0 = 0, β1 = 0.25, β2 = −0.1, β3 = 0.05, and X1 is strongly positively correlated with X2 (ρ = 0.95) and negatively correlated with X3 (ρ = −0.3). In this simple setting, the parameter ψ1 will equal the sum of the β coefficients (0.2). Here we see that qgcomp estimates a value very close to 0.2 (as we increase sample size, the estimated value will be expected to become increasingly close to 0.2).
library("qgcomp")
set.seed(543210)
qdat = simdata_quantized(n=5000, outcomtype="continuous", cor=c(.95, -0.3), b0=0, coef=c(0.25, -0.1, 0.05), q=4)
head(qdat)
## x1 x2 x3 y
## 1 2 2 3 0.5539253
## 2 2 3 1 -0.1005701
## 3 2 2 3 0.3782267
## 4 1 2 2 -0.4557419
## 5 0 0 1 0.6124436
## 6 1 1 2 0.7569574
## x1 x2 x3
## x1 1.00000 0.95552 -0.30368
## x2 0.95552 1.00000 -0.29840
## x3 -0.30368 -0.29840 1.00000
## Scaled effect size (positive direction, sum of positive coefficients = 0.3)
## x1 x3
## 0.775 0.225
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.0966)
## x2
## 1
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.0016583 0.0359609 -0.07214 0.068824 -0.0461 0.9632
## psi1 0.2035810 0.0219677 0.16053 0.246637 9.2673 <2e-16
qgcomp
packageHere we use a running example from the metals
dataset
from the from the package qgcomp
to demonstrate some
features of the package and method.
Namely, the examples below demonstrate use of the package for: 1. Fast estimation of exposure effects under a linear model for quantized exposures for continuous (normal) outcomes 2. Estimating conditional and marginal odds/risk ratios of a mixture effect for binary outcomes 3. Adjusting for non-exposure covariates when estimating effects of the mixture 4. Allowing non-linear and non-homogeneous effects of individual exposures and the mixture as a whole by including product terms 5. Using qgcomp to fit a time-to-event model to estimate conditional and marginal hazard ratios for the exposure mixture
For analogous approaches to estimating exposure mixture effects,
illustrative examples can be seen in the gQWS
package help
files, which implements weighted quantile sum (WQS) regression, and at
https://jenfb.github.io/bkmr/overview.html, which
describes Bayesian kernel machine regression.
The metals
dataset from the from the package
qgcomp
, comprises a set of simulated well water exposures
and two health outcomes (one continuous, one binary/time-to-event). The
exposures are transformed to have mean = 0.0, standard deviation = 1.0.
The data are used throughout to demonstrate usage and features of the
qgcomp
package.
## arsenic barium cadmium calcium chloride chromium
## 1 0.09100165 0.08166362 15.0738845 -0.7746662 -0.15408335 -0.05589104
## 2 0.17018302 -0.03598828 -0.7126486 -0.6857886 -0.19605499 -0.03268488
## 3 0.13336869 0.09934014 0.6441992 -0.1525231 -0.17511844 -0.01161098
## 4 -0.52570747 -0.76616263 -0.8610256 1.4472733 0.02552401 -0.05173287
## 5 0.43420529 0.40629920 0.0570890 0.4103682 -0.24187403 -0.08931824
## 6 0.71832662 0.19559582 -0.6823437 -0.8931696 -0.03919936 -0.07389407
## copper iron lead magnesium manganese mercury
## 1 1.99438050 19.1153352 21.072630908 -0.5109546 2.07630966 -1.20826726
## 2 -0.02490169 -0.2039425 -0.010378362 -0.1030542 -0.36095395 -0.68729723
## 3 0.25700811 -0.1964581 -0.063375935 0.9166969 -0.31075240 0.44852503
## 4 0.75477075 -0.2317787 -0.002847991 2.5482987 -0.23350205 0.20428158
## 5 -0.09919923 -0.1698619 -0.035276281 -0.5109546 0.08825996 1.19283834
## 6 -0.05622285 -0.2129300 -0.118460981 -1.0059145 -0.30219838 0.02875033
## nitrate nitrite ph selenium silver sodium
## 1 1.3649492 -1.0500539 -0.7125482 0.23467592 -0.8648653 -0.41840695
## 2 -0.1478382 0.4645119 0.9443009 0.65827253 -0.8019173 -0.09112969
## 3 -0.3001660 -1.4969868 0.4924330 0.07205576 -0.3600140 -0.11828963
## 4 0.3431814 -0.6992263 -0.4113029 0.23810705 1.3595205 -0.11828963
## 5 0.0431269 -0.5041390 0.3418103 -0.02359910 -1.6078044 -0.40075299
## 6 -0.3986575 0.1166249 1.2455462 -0.61186017 1.3769466 1.83722597
## sulfate total_alkalinity total_hardness zinc mage35 y
## 1 -0.1757544 -1.31353389 -0.85822417 1.0186058 1 -0.6007989
## 2 -0.1161359 -0.12699789 -0.67749970 -0.1509129 0 -0.2022296
## 3 -0.1616806 0.42671890 0.07928399 -0.1542524 0 -1.2164116
## 4 0.8272415 0.99173604 1.99948142 0.1843372 0 0.1826311
## 5 -0.1726845 -0.04789549 0.30518957 -0.1529379 0 1.1760472
## 6 -0.1385631 1.98616621 -1.07283447 -0.1290391 0 -0.4100912
## disease_time disease_state
## 1 6.168764e-07 1
## 2 4.000000e+00 0
## 3 4.000000e+00 0
## 4 4.000000e+00 0
## 5 1.813458e+00 1
## 6 2.373849e+00 1
# we save the names of the mixture variables in the variable "Xnm"
Xnm <- c(
'arsenic','barium','cadmium','calcium','chromium','copper',
'iron','lead','magnesium','manganese','mercury','selenium','silver',
'sodium','zinc'
)
covars = c('nitrate','nitrite','sulfate','ph', 'total_alkalinity','total_hardness')
# Example 1: linear model
# Run the model and save the results "qc.fit"
system.time(qc.fit <- qgcomp.glm.noboot(y~.,dat=metals[,c(Xnm, 'y')], family=gaussian()))
## Including all model terms as exposures of interest
## user system elapsed
## 0.008 0.000 0.008
# user system elapsed
# 0.011 0.002 0.018
# contrasting other methods with computational speed
# WQS regression (v3.0.1 of gWQS package)
#system.time(wqs.fit <- gWQS::gwqs(y~wqs,mix_name=Xnm, data=metals[,c(Xnm, 'y')], family="gaussian"))
# user system elapsed
# 35.775 0.124 36.114
# Bayesian kernel machine regression (note that the number of iterations here would
# need to be >5,000, at minimum, so this underestimates the run time by a factor
# of 50+
#system.time(bkmr.fit <- kmbayes(y=metals$y, Z=metals[,Xnm], family="gaussian", iter=100))
# user system elapsed
# 81.644 4.194 86.520
First note that qgcomp can be very fast relative to competing methods (with their example times given from single runs from a laptop).
One advantage of quantile g-computation over other methods that
estimate “mixture effects” (the effect of changing all exposures at
once), is that it is very computationally efficient. Contrasting methods
such as WQS (gWQS
package) and Bayesian Kernel Machine
regression (bkmr
package), quantile g-computation can
provide results many orders of magnitude faster. For example, the
example above ran 3000X faster for quantile g-computation versus WQS
regression, and we estimate the speedup would be several hundred
thousand times versus Bayesian kernel machine regression.
The speed relies on an efficient method to fit qgcomp when exposures are added additively to the model. When exposures are added using non-linear terms or non-additive terms (see below for examples), then qgcomp will be slower but often still faster than competetive approaches.
Quantile g-computation yields fixed weights in the estimation
procedure, similar to WQS regression. However, note that the weights
from qgcomp.glm.noboot
can be negative or positive. When
all effects are linear and in the same direction (“directional
homogeneity”), quantile g-computation is equivalent to weighted quantile
sum regression in large samples.
The overall mixture effect from quantile g-computation (psi1) is interpreted as the effect on the outcome of increasing every exposure by one quantile, possibly conditional on covariates. Given the overall exposure effect, the weights are considered fixed and so do not have confidence intervals or p-values.
## Scaled effect size (positive direction, sum of positive coefficients = 0.39)
## calcium iron barium silver arsenic mercury sodium chromium
## 0.72216 0.06187 0.05947 0.03508 0.03447 0.02451 0.02162 0.02057
## cadmium zinc
## 0.01328 0.00696
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.124)
## magnesium copper lead manganese selenium
## 0.475999 0.385299 0.074019 0.063828 0.000857
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.356670 0.107878 -0.56811 -0.14523 -3.3062 0.0010238
## psi1 0.266394 0.071025 0.12719 0.40560 3.7507 0.0002001
Now let’s take a brief look under the hood. qgcomp
works
in steps. First, the exposure variables are “quantized” or turned into
score variables based on the total number of quantiles from the
parameter q
. You can access these via the qx
object from the qgcomp
fit object.
## arsenic_q barium_q cadmium_q calcium_q chromium_q copper_q iron_q lead_q
## 1 2 2 3 0 1 3 3 3
## 2 2 2 0 0 2 2 1 2
## 3 2 2 3 1 3 3 1 1
## 4 1 0 0 3 1 3 0 2
## 5 2 3 2 3 0 1 2 2
## 6 3 2 0 0 0 2 1 1
## magnesium_q manganese_q mercury_q selenium_q silver_q sodium_q zinc_q
## 1 1 3 0 2 0 0 3
## 2 2 0 1 3 1 3 0
## 3 3 1 2 2 1 3 0
## 4 3 1 2 2 3 3 3
## 5 1 3 3 1 0 0 0
## 6 0 1 2 0 3 3 2
You can re-fit a linear model using these quantized exposures. This is the “underlying model” of a qgcomp fit.
# regression with quantized data
qc.fit$qx$y = qc.fit$fit$data$y # first bring outcome back into the quantized data
newfit <- lm(y ~ arsenic_q + barium_q + cadmium_q + calcium_q + chromium_q + copper_q +
iron_q + lead_q + magnesium_q + manganese_q + mercury_q + selenium_q +
silver_q + sodium_q + zinc_q, data=qc.fit$qx)
newfit
##
## Call:
## lm(formula = y ~ arsenic_q + barium_q + cadmium_q + calcium_q +
## chromium_q + copper_q + iron_q + lead_q + magnesium_q + manganese_q +
## mercury_q + selenium_q + silver_q + sodium_q + zinc_q, data = qc.fit$qx)
##
## Coefficients:
## (Intercept) arsenic_q barium_q cadmium_q calcium_q chromium_q
## -0.3566699 0.0134440 0.0231929 0.0051773 0.2816176 0.0080231
## copper_q iron_q lead_q magnesium_q manganese_q mercury_q
## -0.0476113 0.0241278 -0.0091464 -0.0588190 -0.0078872 0.0095572
## selenium_q silver_q sodium_q zinc_q
## -0.0001059 0.0136815 0.0084302 0.0027125
Here you can see that, for a GLM in which all quantized exposures
enter linearly and additively into the underlying model the overall
effect from qgcomp
is simply the sum of the adjusted
coefficients from the underlying model.
## [1] 0.2663942
## (intercept) psi1
## -0.3566699 0.2663942
This equality is why we can fit qgcomp so efficiently under such a
model, but qgcomp
is a much more general method that can
allow for non-linearity and non-additivity in the underlying model, as
well as non-linearity in the overall model. These extensions are
described in some of the following examples.
This example introduces the use of a binary outcome in
qgcomp
via the qgcomp.glm.noboot
function,
which yields a conditional odds ratio or the
qgcomp.glm.boot
, which yields a marginal odds ratio or
risk/prevalence ratio. These will not equal each other when there are
non-exposure covariates (e.g. confounders) included in the model
because the odds ratio is not collapsible (both are still valid).
Marginal parameters will yield estimates of the population average
exposure effect, which is often of more interest due to better
interpretability over conditional odds ratios. Further, odds ratios are
not generally of interest when risk ratios can be validly estimated, so
qgcomp.glm.boot
will estimate the risk ratio by default for
binary data (set rr=FALSE to allow estimation of ORs when using
qgcomp.glm.boot
).
qc.fit2 <- qgcomp.glm.noboot(disease_state~., expnms=Xnm,
data = metals[,c(Xnm, 'disease_state')], family=binomial(),
q=4)
qcboot.fit2 <- qgcomp.glm.boot(disease_state~., expnms=Xnm,
data = metals[,c(Xnm, 'disease_state')], family=binomial(),
q=4, B=10,# B should be 200-500+ in practice
seed=125, rr=FALSE)
qcboot.fit2b <- qgcomp.glm.boot(disease_state~., expnms=Xnm,
data = metals[,c(Xnm, 'disease_state')], family=binomial(),
q=4, B=10,# B should be 200-500+ in practice
seed=125, rr=TRUE)
Compare a qgcomp.glm.noboot fit:
## Scaled effect size (positive direction, sum of positive coefficients = 0.392)
## barium zinc chromium magnesium silver sodium
## 0.3520 0.2002 0.1603 0.1292 0.0937 0.0645
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.696)
## selenium copper arsenic calcium manganese cadmium mercury lead
## 0.2969 0.1627 0.1272 0.1233 0.1033 0.0643 0.0485 0.0430
## iron
## 0.0309
##
## Mixture log(OR) (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## (Intercept) 0.26362 0.51615 -0.74802 1.27526 0.5107 0.6095
## psi1 -0.30416 0.34018 -0.97090 0.36258 -0.8941 0.3713
with a qgcomp.glm.boot fit:
## Mixture log(OR) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## (Intercept) 0.26362 0.39038 -0.50151 1.02875 0.6753 0.4995
## psi1 -0.30416 0.28009 -0.85314 0.24481 -1.0859 0.2775
with a qgcomp.glm.boot fit, where the risk/prevalence ratio is estimated, rather than the odds ratio:
## Mixture log(RR) (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI Z value Pr(>|z|)
## (Intercept) -0.56237 0.16106 -0.87804 -0.24670 -3.4917 0.00048
## psi1 -0.16373 0.13839 -0.43497 0.10752 -1.1830 0.23679
In the following code we run a maternal age-adjusted linear model
with qgcomp
(family = "gaussian"
). Further, we
plot both the weights, as well as the mixture slope which yields overall
model confidence bounds, representing the bounds that, for each value of
the joint exposure are expected to contain the true regression line over
95% of trials (so-called 95% ‘pointwise’ bounds for the regression
line). The pointwise comparison bounds, denoted by error bars on the
plot, represent comparisons of the expected difference in outcomes at
each quantile, with reference to a specific quantile (which can be
specified by the user, as below). These pointwise bounds are similar to
the bounds created in the bkmr package when plotting the overall effect
of all exposures. The pointwise bounds can be obtained via the
pointwisebound.boot function. To avoid confusion between “pointwise
regression” and “pointwise comparison” bounds, the pointwise regression
bounds are denoted as the “model confidence band” in the plots, since
they yield estimates of the same type of bounds as the
predict
function in R when applied to linear model
fits.
Note that the underlying regression model is on the quantile ‘score’, which takes on values integer values 0, 1, …, q-1. For plotting purposes (when plotting regression line results from qgcomp.glm.boot), the quantile score is translated into a quantile (range = [0-1]). This is not a perfect correspondence, because the quantile g-computation model treats the quantile score as a continuous variable, but the quantile category comprises a range of quantiles. For visualization, we fix the ends of the plot at the mid-points of the first and last quantile cut-point, so the range of the plot will change slightly if ‘q’ is changed.
qc.fit3 <- qgcomp.glm.noboot(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc,
expnms=Xnm,
metals, family=gaussian(), q=4)
qc.fit3
## Scaled effect size (positive direction, sum of positive coefficients = 0.381)
## calcium barium iron silver arsenic mercury chromium zinc
## 0.74466 0.06636 0.04839 0.03765 0.02823 0.02705 0.02344 0.01103
## sodium cadmium
## 0.00775 0.00543
##
## Scaled effect size (negative direction, sum of negative coefficients = -0.124)
## magnesium copper lead manganese selenium
## 0.49578 0.35446 0.08511 0.06094 0.00372
##
## Mixture slope parameters (delta method CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.348084 0.108037 -0.55983 -0.13634 -3.2219 0.0013688
## psi1 0.256969 0.071459 0.11691 0.39703 3.5960 0.0003601
From the first plot we see weights from
qgcomp.glm.noboot
function, which include both positive and
negative effect directions. When the weights are all on a single side of
the null, these plots are easy to in interpret since the weight
corresponds to the proportion of the overall effect from each exposure.
WQS uses a constraint in the model to force all of the weights to be in
the same direction - unfortunately such constraints lead to biased
effect estimates. The qgcomp
package takes a different
approach and allows that “weights” might go in either direction,
indicating that some exposures may beneficial, and some harmful, or
there may be sampling variation due to using small or moderate sample
sizes (or, more often, systematic bias such as unmeasured confounding).
The “weights” in qgcomp
correspond to the proportion of the
overall effect when all of the exposures have effects in the same
direction, but otherwise they correspond to the proportion of the effect
in a particular direction, which may be small (or large)
compared to the overall “mixture” effect. NOTE: the left and right sides
of the plot should not be compared with each other because the length of
the bars corresponds to the effect size only relative to other effects
in the same direction. The darkness of the bars corresponds to the
overall effect size - in this case the bars on the right (positive) side
of the plot are darker because the overall “mixture” effect is positive.
Thus, the shading allows one to make informal comparisons across the
left and right sides: a large, darkly shaded bar indicates a larger
independent effect than a large, lightly shaded bar.
qcboot.fit3 <- qgcomp.glm.boot(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc,
expnms=Xnm,
metals, family=gaussian(), q=4, B=10,# B should be 200-500+ in practice
seed=125)
qcboot.fit3
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.342787 0.114983 -0.56815 -0.11742 -2.9812 0.0030331
## psi1 0.256969 0.075029 0.10991 0.40402 3.4249 0.0006736
qcee.fit3 <- qgcomp.glm.ee(y ~ mage35 + arsenic + barium + cadmium + calcium + chloride +
chromium + copper + iron + lead + magnesium + manganese +
mercury + selenium + silver + sodium + zinc,
expnms=Xnm,
metals, family=gaussian(), q=4)
qcee.fit3
## Mixture slope parameters (robust CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.34279 0.10140 -0.54153 -0.14405 -3.3805 0.0007890
## psi1 0.25697 0.06771 0.12426 0.38968 3.7952 0.0001686
We can change the referent category for pointwise comparisons via the
pointwiseref
parameter:
## quantile quantile.midpoint linpred m se.pw ll.pw
## 1 0 0.125 -0.34278746 -0.34278746 0.10145237 -0.5416304
## 2 1 0.375 -0.08581809 -0.08581809 0.03749777 -0.1593124
## 3 2 0.625 0.17115127 0.17115127 0.04143146 0.0899471
## 4 3 0.875 0.42812064 0.42812064 0.10594353 0.2204751
## ul.pw ll.simul ul.simul
## 1 -0.14394447 -0.58955749 -0.097080902
## 2 -0.01232381 -0.17571151 0.004888375
## 3 0.25235544 0.07102589 0.271120767
## 4 0.63576615 0.17123926 0.681549112
plot(qcee.fit3, pointwiseref = 3, flexfit = FALSE, modelbound=TRUE)
plot(qcboot.fit3, pointwiseref = 3, flexfit = FALSE)
Using qgcomp.glm.boot
also allows us to assess linearity
of the total exposure effect (the second plot). Similar output is
available for WQS (gWQS
package), though WQS results will
generally be less interpretable when exposure effects are non-linear
(see below how to do this with qgcomp.glm.boot
and
qgcomp.glm.ee
).
The plot for the qcboot.fit3
object (using g-computation
with bootstrap variance) gives predictions at the joint intervention
levels of exposure. It also displays a smoothed (graphical) fit.
Note that the uncertainty intervals given in the plot are directly
accessible via the pointwisebound
(pointwise comparison
confidence intervals) and modelbound
functions (confidence
interval for the regression line):
## quantile quantile.midpoint linpred mean.diff se.diff ll.diff
## 0 0 0.125 -0.34278746 -0.5139387 0.15005846 -0.8080479
## 1 1 0.375 -0.08581809 -0.2569694 0.07502923 -0.4040240
## 2 2 0.625 0.17115127 0.0000000 0.00000000 0.0000000
## 3 3 0.875 0.42812064 0.2569694 0.07502923 0.1099148
## ul.diff ll.linpred ul.linpred
## 0 -0.2198295 -0.6368966 -0.04867828
## 1 -0.1099148 -0.2328727 0.06123650
## 2 0.0000000 0.1711513 0.17115127
## 3 0.4040240 0.2810660 0.57517523
## quantile quantile.midpoint linpred m se.pw ll.pw
## 0 0 0.125 -0.34278746 -0.34278746 0.11498301 -0.56815001
## 1 1 0.375 -0.08581809 -0.08581809 0.04251388 -0.16914376
## 2 2 0.625 0.17115127 0.17115127 0.04065143 0.09147593
## 3 3 0.875 0.42812064 0.42812064 0.11294432 0.20675384
## ul.pw ll.simul ul.simul
## 0 -0.117424905 -0.4622572 -0.161947578
## 1 -0.002492425 -0.1191843 -0.005233921
## 2 0.250826606 0.1386622 0.223888629
## 3 0.649487428 0.3081934 0.566961520
Because qgcomp estimates a joint effect of multiple exposures, we cannot, in general, assess model fit by overlaying predictions from the plots above with the data. Hence, it is useful to explore non-linearity by fitting models that allow for non-linear effects, as in the next example.
qgcomp
(specifically qgcomp.*.boot
and
qgcomp.*.ee
methods) addresses non-linearity in a way
similar to standard parametric regression models, which lends itself to
being able to leverage R language features for n-lin parametric models
(or, more precisely, parametric models that deviate from a purely
additive, linear function on the link function basis via the use of
basis function representation of non-linear functions). Here is an
example where we use a feature of the R language for fitting models with
interaction terms. We use y~. + .^2
as the model formula,
which fits a model that allows for quadratic term for every predictor in
the model.
Note that both qgcomp.*.boot
(bootstrap) and
qgcomp.*.ee
(estimating equations) use standard methods for
g-computation, whereas the qgcomp.*.noboot
methods use a
fast algorithm that works under the assumption of linearity and
additivity of exposures (as described in the original paper on
quantile-based g-computation). The “standard” method of g-computation
with time-fixed exposures involves first fitting conditional models for
the outcome, making predictions from those models under set exposure
values, and then summarizing the predicted outcome distribution,
possibly by fitting a second (marginal structural) model.
qgcomp.*.boot
follows this three-step process, while
qgcomp.*.ee
leverages estimating equations (sometimes:
M-estimation) to estimate the parameters of the conditional and marginal
structural model simultaneously. qgcomp.*.ee
uses a
sandwich variance estimator, which is similar to GEE (generalized
estimating equation) approaches, and thus, when used correctly, can
yield inference for longitudinal data in the same way that GEE does. The
bootstrapping approach can also do this, but it takes longer. The
extension to longitudinal data is representative of the broader concept
that qgcomp.*.boot
and qgcomp.*.ee
can be used
in a broader number of settings than qgcomp.*.noboot
algorithms, but if one assumes linearity and additivity with no
clustering of observations, and conditional parameters are of interest,
then they are just a slower way to get equivalent results to
qgcomp.*.noboot
.
Below, we demonstrate a non-linear conditional fit (with a linear MSM) using the bootstrap approach. Similar approaches could be used to include interaction terms between exposures, as well as between exposures and covariates. Note this example is purposefully done incorrectly, as explained below.
qcboot.fit4 <- qgcomp(y~. + .^2,
expnms=Xnm,
metals[,c(Xnm, 'y')], family=gaussian(), q=4, B=10, seed=125)
plot(qcboot.fit4)
Note that allowing for a non-linear effect of all exposures induces an apparent non-linear trend in the overall exposure effect. The smoothed regression line is still well within the confidence bands of the marginal linear model (by default, the overall effect of joint exposure is assumed linear, though this assumption can be relaxed via the ‘degree’ parameter in qgcomp.glm.boot or qgcomp.glm.ee, as follows:
qcboot.fit5 <- qgcomp(y~. + .^2,
expnms=Xnm,
metals[,c(Xnm, 'y')], family=gaussian(), q=4, degree=2,
B=10, rr=FALSE, seed=125)
plot(qcboot.fit5)
qcee.fit5b <- qgcomp.glm.ee(y~. + .^2,
expnms=Xnm,
metals[,c(Xnm, 'y')], family=gaussian(), q=4, degree=2,
rr=FALSE)
plot(qcee.fit5b)
Note that some features are not availble to qgcomp.*.ee methods, which use estimating equations, rather than maximum likelihood methods. Briefly, these allow assessment of uncertainty under n-lin (and other) scenarios where the qgcomp.*.noboot functions cannot, since they rely on the additivity and linearity assumptions to achieve speed. The qgcomp.*.ee methods will generally be faster than a bootstrapped version, but they are not used extensively here because they are the newest additions to the qgcomp package, and the bootstrapped versions can be made fast (but not accurate) by reducing the number of bootstraps. Where available, the qgcomp.*.ee will be preferred to the qgcomp.*.boot versions for more stable and faster analyses when bootstrapping would otherwise be necessary.
Once again, we can access numerical estimates of uncertainty (answers
differ between the qgcomp.*.boot
and
qgcomp.*.ee
fits due to the small number of bootstrap
samples):
## quantile quantile.midpoint linpred m se.pw ll.pw
## 0 0 0.125 -0.89239044 -0.89239044 0.70335801 -2.2709468
## 1 1 0.375 -0.18559680 -0.18559680 0.09874085 -0.3791253
## 2 2 0.625 0.12180659 0.12180659 0.14624914 -0.1648365
## 3 3 0.875 0.02981974 0.02981974 0.83609757 -1.6089014
## ul.pw ll.simul ul.simul
## 0 0.486165922 -2.08813110 0.18308760
## 1 0.007931712 -0.32577422 -0.02886665
## 2 0.408449640 -0.06371183 0.43431431
## 3 1.668540859 -1.19007238 1.57263047
## quantile quantile.midpoint linpred mean.diff se.diff ll.diff
## 0 0 0.125 -0.89239044 0.0000000 0.0000000 0.0000000
## 1 1 0.375 -0.18559680 0.7067936 0.6165306 -0.5015841
## 2 2 0.625 0.12180659 1.0141970 0.6044460 -0.1704954
## 3 3 0.875 0.02981974 0.9222102 0.3537861 0.2288022
## ul.diff ll.linpred ul.linpred
## 0 0.000000 -0.8923904 -0.8923904
## 1 1.915171 -1.3939746 1.0227810
## 2 2.198889 -1.0628859 1.3064991
## 3 1.615618 -0.6635882 0.7232277
## quantile quantile.midpoint linpred mean.diff se.diff ll.diff
## 1 0 0.125 -0.89239044 0.0000000 0.0000000 0.0000000
## 2 1 0.375 -0.18559680 0.7067936 0.4922129 -0.2579259
## 3 2 0.625 0.12180659 1.0141970 0.9844258 -0.9152420
## 4 3 0.875 0.02981974 0.9222102 1.4766386 -1.9719484
## ul.diff
## 1 0.000000
## 2 1.671513
## 3 2.943636
## 4 3.816369
Ideally, the smooth fit will look very similar to the model prediction regression line.
As the output below shows, setting “degree=2” yields a second parameter in the model fit (ψ2). The output of qgcomp now corresponds to estimates of the marginal structural model given by 𝔼(YXq) = g(ψ0 + ψ1Sq + ψ2Sq2)
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.89239 0.70336 -2.27095 0.48617 -1.2688 0.2055
## psi1 0.90649 0.93820 -0.93235 2.74533 0.9662 0.3347
## psi2 -0.19970 0.32507 -0.83682 0.43743 -0.6143 0.5395
so that ψ2 can be interpreted similar to quadratic terms that might appear in a generalized linear model. ψ2 estimates the change in the outcome for an additional unit of squared joint exposure, over-and-above the linear effect given by ψ1. Informally, this is a way of formally assessing specific types of non-linearity in the joint exposure-response curves, and there are many other (slightly incorrect but intuitively useful) ways of interpreting parameters for squared terms in regressions (beyond the scope of this document). Intuition from generalized linear models applies directly to the models fit by quantile g-computation.
Exploring a non-linear fit in settings with multiple exposures is challenging. One way to explore non-linearity, as demonstrated above, is to to include all 2-way interaction terms (including quadratic terms, or “self-interactions”). Sometimes this approach is not desired, either because the number of terms in the model can become very large, or because some sort of model selection procedure is required, which risks inducing over-fit (biased estimates and standard errors that are too small). Short of having a set of a priori non-linear terms to include, we find it best to take a default approach (e.g. taking all second order terms) that doesn’t rely on statistical significance, or to simply be honest that the search for a non-linear model is exploratory and shouldn’t be relied upon for robust inference. Methods such as kernel machine regression may be good alternatives, or supplementary approaches to exploring non-linearity.
NOTE: qgcomp necessarily fits a regression model with exposures that
have a small number of possible values, based on the quantile chosen. By
package default, this is q=4
, but it is difficult to fully
examine non-linear fits using only four points, so we recommend
exploring larger values of q
, which will change effect
estimates (i.e. the model coefficient implies a smaller change in
exposures, so the expected change in the outcome will also
decrease).
Here, we examine a one strategy for default and exploratory
approaches to mixtures that can be implemented in qgcomp using a smaller
subset of exposures (iron, lead, cadmium), which we choose via the
correlation matrix. High correlations between exposures may result from
a common source, so small subsets of the mixture may be useful for
examining hypotheses that relate to interventions on a common
environmental source or set of behaviors. Note that we can still adjust
for the measured exposures, even though only 3 our exposures of interest
are considered as the mixture of interest. Note that we will require a
new R package to help in exploring non-linearity: splines
.
Note that qgcomp.glm.boot
must be used in order to produce
the graphics below, as qgcomp.glm.noboot
does not calculate
the necessary quantities.
The underlying conditional model fit can be made extremely flexible, and the graphical representation of this (via the smooth conditional fit) can look extremely flexible. Simply matching the overall (MSM) fit to this line is not a viable strategy for identifying parsimonious models because that would ignore potential for overfit. Thus, caution should be used when judging the accuracy of a fit when comparing the “smooth conditional fit” to the “MSM fit.”
qc.overfit <- qgcomp.glm.boot(y ~ bs(iron) + bs(cadmium) + bs(lead) +
mage35 + bs(arsenic) + bs(magnesium) + bs(manganese) + bs(mercury) +
bs(selenium) + bs(silver) + bs(sodium) + bs(zinc),
expnms=Xnm,
metals, family=gaussian(), q=8, B=10, degree=1)
qc.overfit
## Mixture slope parameters (bootstrap CI):
##
## Estimate Std. Error Lower CI Upper CI t value Pr(>|t|)
## (Intercept) -0.064420 0.151380 -0.36112 0.23228 -0.4255 0.6706
## psi1 0.029869 0.038970 -0.04651 0.10625 0.7665 0.4438
Here, there is little statistical evidence for even a linear trend, which makes the smoothed conditional fit appear to be overfit. The smooth conditional fit can be turned off, as below.
Note that these are included as examples of how to include non-linearities, and are not intended as a demonstration of appropriate model selection. In fact, qc.fit7b is generally a bad idea in small to moderate sample sizes due to large numbers of parameters.
qc.fit7a <- qgcomp.glm.boot(y ~ factor(iron) + lead + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc,
expnms=newXnm,
metals, family=gaussian(), q=8, B=20, deg=2)
# underlying fit
summary(qc.fit7a$fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.052981109 0.08062430 -0.6571357 5.114428e-01
## factor(iron)1 0.046571725 0.09466914 0.4919420 6.230096e-01
## factor(iron)2 -0.056984300 0.09581659 -0.5947227 5.523395e-01
## factor(iron)3 0.143813131 0.09558055 1.5046276 1.331489e-01
## factor(iron)4 0.053319057 0.09642069 0.5529835 5.805601e-01
## factor(iron)5 0.303967859 0.09743261 3.1197753 1.930856e-03
## factor(iron)6 0.246259568 0.09734385 2.5297907 1.176673e-02
## factor(iron)7 0.447045591 0.09786201 4.5681217 6.425565e-06
## lead -0.009210341 0.01046668 -0.8799679 3.793648e-01
## cadmium -0.010503041 0.01086440 -0.9667387 3.342143e-01
## mage35 0.081114695 0.07274583 1.1150426 2.654507e-01
## arsenic 0.021755516 0.02605850 0.8348720 4.042502e-01
## magnesium -0.010758356 0.02469893 -0.4355798 6.633587e-01
## manganese 0.004418266 0.02551449 0.1731670 8.626011e-01
## mercury 0.003913896 0.02448078 0.1598763 8.730531e-01
## selenium -0.058085344 0.05714805 -1.0164012 3.100059e-01
## silver 0.020971562 0.02407397 0.8711302 3.841658e-01
## sodium -0.062086322 0.02404447 -2.5821454 1.014626e-02
## zinc 0.017078438 0.02392381 0.7138679 4.756935e-01
qc.fit7b <- qgcomp.glm.boot(y ~ factor(iron)*factor(lead) + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc,
expnms=newXnm,
metals, family=gaussian(), q=8, B=10, deg=3)
# underlying fit
#summary(qc.fit7b$fit)$coefficients
plot(qc.fit7b)
qc.fit7c <- qgcomp.glm.boot(y ~ I(iron>4)*I(lead>4) + cadmium +
mage35 + arsenic + magnesium + manganese + mercury +
selenium + silver + sodium + zinc,
expnms=newXnm,
metals, family=gaussian(), q=8, B=10, deg=2)
# underlying fit
summary(qc.fit7c$fit)$coefficients
## Estimate Std. Error t value
## (Intercept) -5.910113e-02 0.05182385 -1.140423351
## I(iron > 4)TRUE 3.649940e-01 0.06448858 5.659824144
## I(lead > 4)TRUE -9.004067e-05 0.06181587 -0.001456595
## cadmium -6.874749e-03 0.01078339 -0.637531252
## mage35 7.613672e-02 0.07255110 1.049422029
## arsenic 2.042370e-02 0.02578001 0.792230124
## magnesium -3.279980e-03 0.02427513 -0.135116878
## manganese 1.055979e-02 0.02477453 0.426235507
## mercury 9.396898e-03 0.02435057 0.385900466
## selenium -4.337729e-02 0.05670006 -0.765030761
## silver 1.807248e-02 0.02391112 0.755819125
## sodium -5.537968e-02 0.02403808 -2.303831424
## zinc 2.349906e-02 0.02385762 0.984970996
## I(iron > 4)TRUE:I(lead > 4)TRUE -1.828835e-01 0.10277790 -1.779405131
## Pr(>|t|)
## (Intercept) 2.547332e-01
## I(iron > 4)TRUE 2.743032e-08
## I(lead > 4)TRUE 9.988385e-01
## cadmium 5.241120e-01
## mage35 2.945626e-01
## arsenic 4.286554e-01
## magnesium 8.925815e-01
## manganese 6.701456e-01
## mercury 6.997578e-01
## selenium 4.446652e-01
## silver 4.501639e-01
## sodium 2.169944e-02
## zinc 3.251821e-01
## I(iron > 4)TRUE:I(lead > 4)TRUE 7.586670e-02
Note one restriction on exploring non-linearity: while we can use
flexible functions such as splines for individual exposures, the overall
fit is limited via the degree
parameter to polynomial
functions (here a quadratic polynomial fits the non-linear model well,
and a cubic polynomial fits the non-linear/non-homogeneous model well -
though this is an informal argument and does not account for the wide
confidence intervals). We note here that only 10 bootstrap iterations
are used to calculate confidence intervals (to increase computational
speed for the example), which is far too low.
boot
or
ee
functions? (and other questions about the weights/scaled
effect sizes)Users often use the qgcomp.*.boot
or
qgcomp.*.ee
functions because they want to marginalize over
confounders or fit a non-linear joint exposure function. In both cases,
the overall exposure response will no longer correspond to a simple
weighted average of model coefficients, so none of the
qgcomp.*.boot
or qgcomp.*.ee
functions will
calculate weights. In most use cases, the weights would vary according
to which level of joint exposure you’re at, so it is not a
straightforward proposition to calculate them (and you may not wish to
report 4 sets of weights if you use the default q=4
). That
is, the contribution of each exposure to the overall effect will change
across levels of exposure if there is any non-linearity, which makes the
weights not useful as simple inferential tools (and, at best, an
approximation). If you wish to have an approximation, then use a
“noboot” method and report the weights from that along with a caveat
that they are not directly applicable to the results in which
non-linearity/non-additivity/marginalization-over-covariates is
performed (using boot methods). If you fit an otherwise linear model,
you can get weights from a qgcomp.*.noboot
which will be
very close to the weights you might get from a linear model fit via
qgcomp.*.boot
functions, but be explicit that the weights
come from a different model than the inference about joint exposure
effects.
It should be emphasized here that the weights are not a stable or
entirely useful quantity for many research goals. Qgcomp addresses the
mixtures problem of variance inflation by focusing on a parameter that
is less susceptible to variance inflation than independent effects (the
psi parameters, or overall effects of a mixture). The weights are a form
of independent effect and will always be sensitive to this issue,
regardless of the statistical method that is used. Some statistical
approaches to improving estimation of independent effects (e.g. setting
bayes=TRUE
) is accessible in many qgcomp functions, but
these approaches universally introduce bias in exchange for reducing
variance and shouldn’t be used without a good understanding of what
shrinkage and penalization methods actually accomplish. Principled and
rigorous integration of these statistical approaches with qgcomp is in
progress, but that work is inherently more bespoke and likely will not
be available in this R package. The shrinkage and penalization
literature is large and outside the scope of this software and
documentation, so no other guidance is given here. In any case, the
calculated weights are only interpretable as proportional effect sizes
in a setting in which there is linearity, additivity, and
collapsibility, and so the package makes no efforts to try to introduce
weights into other settings in which those assumptions may not be met.
Outside of those narrow settings, the weights would have a dubious
interpretation and the programming underlying the qgcomp package errs on
the side of preventing the reporting of results that are mutually
inconsistent. If you are using the boot
versions of a
qgcomp function in a setting in which you know that the weights are
valid, it is very likely that you do not actually need to be using the
boot
versions of the functions.
Maybe. The inferential object of qgcomp is the set of ψ parameters that correspond to a joint exposure response. As it turns out, with correlated exposures non-linearity can disguise itself as non-additivity (Belzak and Bauer [2019] Addictive Behaviors). If we were inferring independent effects, this distinction would be crucial, but for joint effects it may turn out that it doesn’t matter much if you model non-linearity in the joint response function through non-additivity or non-linearity of individual exposures in a given study. Models fit in qgcomp still make the crucial assumption that you are able to model the joint exposure response via parametric models, so that assumption should not be forgotten in an effort to try to disentagle non-linearity (e.g. quadratic terms of exposures) from non-additivity (e.g. product terms between exposures). The important part to note about parametric modeling is that we have to explicitly tell the model to be non-linear, and no adaptation to non-linear settings will happen automatically. Exploring non-linearity is not a trivial endeavor.
No. You can turn off “quantization” by setting q=NULL
or
you can supply your own categorization cutpoints via the “breaks”
argument. It is up to the user to interpret the results if either of
these options is taken. Frequently, q=NULL
is used in
concert with standardizing exposure variables by dividing them by their
interquartile ranges (IQR). The joint exposure response can then be
interpreted as the effect of an IQR change in all exposures. Using IQR/2
(with or without a log transformation before hand) will yield results
that are most (roughly) compatible with the package defaults
(q=4
) but that does not require quantization. Quantized
variables have nice properties: they prevent extrapolation and reduce
influence of outliers, but the choice of how to include exposures in the
model should be a deliberate and well-informed one. There are examples
of setting q=NULL
in the help files for qgcomp.glm.boot and
qgcomp.glm.ee, but this approach is available for any of the qgcomp
methods (and is accomplished nearly 100% outside of the package
functions, aside from setting q=NULL
).
Probably not in a scientific manuscript. If you find an idea here that is not published anywhere else and wish to develop it into a full manuscript, feel free! (But probably check with [email protected] to ask if a paper is already in development or is, perhaps, already published.)
The vignettes of the package and the help files of the functions give many, many examples of usage. Additionally, some edge case or interesting applications in github gists available at https://gist.github.com/alexpkeil1. If you come up with an interesting problem that you think could be solved in this package, but is currently not, feel free to submit an issue on the package github page https://github.com/alexpkeil1/qgcomp/issues. Several additions to the package have already come about through that avenue.
Alexander P. Keil, Jessie P. Buckley, Katie M. O’Brien, Kelly K. Ferguson, Shanshan Zhao, Alexandra J. White. A quantile-based g-computation approach to addressing the effects of exposure mixtures. https://doi.org/10.1289/EHP5838
The development of this package was supported by NIH Grant RO1ES02953101. Invaluable code testing has been performed by Nicole Niehoff, Michiel van den Dries, Emily Werder, Jessie Buckley, Barrett Welch, Che-Jung (Rong) Chang, various github users, and Katie O’Brien. Thank you to Hanna Jardel for contributing the “qgcomp.tobit.noboot” function.