Logistic Regression and Generalized Boosted Modelling in Inverse Probability ofTreatment Weighting: A Simulation and Case Study of Outpatients with CoronaryHeart Disease

Objectives: To compare the ability of logistic regression and generalized boosted modelling (GBM) to estimate treatment effects and balance covariates in inverse probability of treatment weighting (IPTW) and to explore the independent impact of different types of medical insurance on drug costs of outpatients with coronary heart disease based on this method. Methods: This study was used to evaluate the performance of logistic regression and GBM in IPTW under a Monte Carlo study with the simulated design of linear and nonlinear correlations between treatment variable and covariates of different sample sizes (n=500, 2000). The assessment indicators included average standardized absolute mean difference (ASAM), point estimation, bias, relative bias, standard error, mean square error, 95% confidence interval coverage rate, distribution of weights and an empirical study of outpatients with coronary heart disease was carried out after simulation. Results: The simulations show that GBM in propensity score weighting is superior to logistic regression in the lower bias and mean square error and it achieves better covariate balance, especially in nonlinear conditioning models. And in this case study. It’s found that GBM in IPTW has better ability to balance the confounding factors compared with logistic regression. The weighted results show that the drug costs of outpatients with coronary heart disease of Urban Employee Basic Medical Insurance increase by 256.35 Yuan on average compared with those of Urban-Rural Resident Basic Medical Insurance. Conclusion: It may be better to control confounding factors in case of the unknown relationship between the treatment variable and covariates by IPTW with GBM. There is still a certain gap in drug costs among different types of medical insurance for patients with coronary heart disease according to this study, which provides a reasonable scientific basis for the optimal allocation of medical insurance system and health resources in coronary heart disease.


Introduction
Although randomized controlled trials are regarded as the gold standard in research design, they are often not implemented or cannot accurately reflect the effect of interest in real experimental designs or social and health sciences due to ethics, time and cost constraints and other reasons [1,2]. A large quantity of observational data has become easily available with the development of hospital informatization. However, observational studies cannot control the confounding factors between treatment and non-treatment groups by means of random assignment. Therefore, selection bias often occurs when estimating the treatment effect. Traditional methods used to reduce bias in observational studies include regression models, matching and stratification, but these methods may not be suitable for studies with a large number of covariates [3][4][5]. Propensity score methods can transform multiple covariates (multi-dimensional) into propensity scores (one-dimensional), which can solve these problems. The application of propensity score methods includes mainly matching, stratification, weighting and regression adjustment [6][7][8][9]. Propensity score weighting has substantial advantages over matching and is used to analyse treatment effects by retaining all the individuals in the sample, which can significantly improve statistical efficiency. In recent years, propensity score weighting has been increasingly being used in the medical field [10][11][12].
Propensity scores are commonly estimated by logistic regression [13][14][15]. Modelling requires several assumptions related to selecting variables and specifying functional forms [16][17][18]. Therefore, The conditioning model for predicting propensity scores Logistic regression: The process of obtaining propensity scores with logistic regression is very simple and easy to understand and perform, but the model must be transformed into a linear model via an appropriate connection function such as the logit function [31]. However, when the relationship between covariates and transformation variables does not satisfy the linear hypothesis, the values of the propensity scores are often unreliable.

Sci Forschen
Generalized boosted modelling: GBM is a modern nonparametric boosting method based on a regression tree that can fit multiple models with a regression tree as a weak classifier and then use a boosting algorithm to merge the predictions from each model [32,33]. Regression trees can be applied to continuous, normal, ordered and missing independent variables to automatically obtain nonlinear relations and interactions [34]. In contrast to logistic regression, there is no need to set the functional form of the predictive variable. The boosting algorithm can combine many simple models into a more complex model. Compared with a simple regression tree, the boosting algorithm can obtain smooth fitting and good prediction and to a large extent, can avoid over fitting the data [35,36]. Thus, GBM is used to incorporate a large number of measured baseline covariates to fit a conditioning model. The formula for estimating a propensity score is ( ) log ( ( )) log( ( )/(1 ( ))) g X it e X e X e X = = − [37]. Specifically, GBM begins with a regression tree and sets as the initial value, where w is the mean of the treatment variable in the sample. Then, an adjustment function (X) h is found to improve the fitting degree of the model. (X) h can be in any form: here, it is the regression tree obtained by fitting the residuals of the currently estimated log ( ( )) it e X with X. log ( ( )) it e X is updated by by means of continuous iteration, and the estimation of the loglikelihood increases correspondingly with every iteration. λ is a shrinkage coefficient: generally, the smaller λ is, the smoother the fit is. The shrinkage coefficient ranges from 0 to 1. McCaffrey suggested stopping the number of iterations when the average standardized absolute mean difference (ASAM) on measured baseline covariates is minimized [37]. In this study, the shrinkage coefficient is 0.01, and the stopping rule is to minimize the ASAM across covariates.

Simulation design
The simulated structure described by Shenyang Guo and Mark W. Fraser in the book named "Propensity Score Analysis: Statistics Methods and Applications" was modified slightly in this study [1]. First, five baseline covariates ( 5 1 X X − ) were considered 1 X , 3 X and 5 X were continuous variables that followed a normal distribution with a mean vector (3 11 6) and standard deviation vector (0.3 4.0 2.5), and 2 X and 4 X were classified variables of a Bernoulli distribution. A binary treatment variable with two groups of treatment exposure was created, three variables ( 1 X , 2 X , and 5 X ) were used to generate the treatment variable (W ), and the probability of the treatment assignment for each group was defined as Pr( 1 ) 1/ (1 exp( ( ( , ) ))) . Function f changed with the variation of coefficient β in different situations, and the average treatment probability was approximately 0.5. Variables 1 X to 4 X were subjective decisions are often involved in finding the best model. Moreover, when one of these assumptions is wrong, covariate balance may be impossible to achieve by adjusting propensity scores, which may lead to bias in the estimation of the treatment effect [19,20]. In addition, logistic regression cannot address complex relationships such as non linear and interaction effects between covariates [21][22][23]. Generalized boosted modelling (GBM), one of the latest prediction methods in machine learning, is superior to logistic regression in addressing high-dimensional and missing data [1,24,25] and can be used as an alternative nonparametric method to achieve the same purpose but with fewer assumptions and more accurate results. Unfortunately, GBM is not widely used with propensity scores. Therefore, in the absence of hidden bias and different sample sizes, this study intends to construct linear and nonlinear conditioning models to compare the ability of logistic regression and GBM to estimate the treatment effect and balance covariates in propensity score weighting to provide some reference for the processing of confounding factors of various complex data. In Chongqing, Coronary heart disease (CHD) was included in special disease management in 2012. The reimbursement rates of different types of medical insurance are different [26,27], so we wanted to analyse the independent impact of medical insurance between urban employees and urban-rural residents on drug costs of outpatients with CHD. However, because of the large sample size of outpatient data and the uneven baseline, we used this case study as an empirical analysis.

Basic assumptions and the average treatment effect
As proposed originally by Rosenbaum PR, et al. [28], the propensity score refers to the conditional probability that members  is also called a balancing score and X is a vector of observed baseline covariates that may impact the selection of the treatment and outcome variables. Treatment assignments are independent of the observed baseline covariates given the propensity score . Therefore, if the propensity scores of the treatment and non-treatment groups are similar, each member has the same probability of being assigned to the treatment group as in a randomized experiment, even though they have different values of some covariates. If the strongly ignorable treatment assumption (SITA) is achieved, that is, the potential outcomes of the treatment ( 1 Y ) and non-treatment ( 0 Y ) groups are independent of the selection of treatment , then an unbiased estimate of the average treatment effect (ATE) is written as Propensity score weighting aims to assign a weight to every member such that the weights represent the whole. The method of estimating ATE is known as the inverse probability of treatment weighting (IPTW). The weight for the treatment group is ) ( / 1 X e and for the non-treatment group is 1 / (1 ( )) e X − . Instead of creating similar propensity between two groups, IPTW creates a weighted analysis with unequal weights, which is easy to implement. Therefore, if propensity scores are properly estimated, the weights can explain the difference in the observed covariate distribution between the treatment and non-treatment groups, and ATE is equal to the weighted average of the difference between the outcomes of the two groups, that is, [29]. When ) ( i X e is estimated by machine learning, no explicit formula may exist for variance estimation. To show the weights more clearly, Imbens GW [30] proposed an alternative standardized estimator for ATE, which

Journal of Epidemiology and Public Health Reviews
Open Access Journal used to compute a continuous outcome variable (Y ) from the following regression equation: . The true treatment effect (τ ) was equal to 1.5, which was known in advance. The error terms for the treatment ( v ) and outcome ( u ) variables, which represented the amount of unexplained variance after accounting for independent variables, were both generated from normal distributions with mean 0 and variance 1. According to SITA, when 5 X was correlated with the outcome error ( u ) but the treatment error ( v ) and outcome error ( u ) were uncorrelated, no hidden bias existed in specifying the conditioning model. The correlation coefficients for distinct pairs of variables among 1 X , 3 X , and 5 X and u were 0.2 ( X X − ) were considered to reduce potential variability in the simulated data. In this study, two scenarios were considered to estimate ATE using IPTW, and propensity scores were estimated using logistic regression and GBM. The two scenarios were as follows: The relationship between the treatment variable and baseline covariates was linear, that is, the correct conditional model for predicting propensity scores was , and the outcome regression model was

Scenario 2:
The relationship between the treatment variable and baseline covariates was nonlinear. Additionally, there were no interaction effects and quadratic terms, that is, the correct conditional model for predicting propensity scores was , and the outcome regression model was

Statistical analysis
All analyses were conducted in the R software environment (version 3.5.3) and STATA (version 15.0). IPTW with GBM was implemented using the ps function from the twang package, and IPTW with logistic regression was implemented using the dxwts function. All other analyses were conducted in STATA. The following were used as metrics to assess the different conditioning models: Average standardized absolute mean difference: used to test the balance of baseline covariates, a low ASAM indicates that specific values of baseline covariates between two groups are similar; The performance of estimated treatment effects: point estimation ( W α ), bias, relative bias (RB) and standard error of the effect estimation (SE); Mean Square Error (MSE): the variation in the sampling distribution of estimated treatment effects, a smaller value indicates higher model accuracy and less variation; 95% CI coverage rate: the percentage of the 95% confidence interval estimated by the model containing the true treatment effect; Distribution of weights: propensity scores range from 0 to 1. Values close to 0 or 1 entail extreme weights that reduce the accuracy of IPTW.
All indicators were calculated as the average of 1000 results.

Simulation results
The treatment effect and accuracy of the models: When the conditioning model was linear, the RB of logistic regression (N=500:0.20%, N=2000:0.17%) was slightly lower than that of GBM (N=500:0.28%, N=2000:0.27%). The average standard error, MSE and 95% CI coverage rate were similar, and the standard error and MSE decreased with increasing sample size, while the 95% CI coverage rate decreased with increasing sample size. When the conditioning model was nonlinear, the RB of logistic regression (N=500:0.55%, N=2000:0.77%) increased significantly, and many outliers were observed in the distribution of bias, while GBM remained stable (N=2000:0.37%). The average standard error (N=500:0.1401, N=2000:0.0909) and MSE (N=500:0.0428, N=2000:0.0175) of logistic regression increased significantly and were higher than those of GBM under the same sample size. The 95% CI coverage rate (N=500:81.7%, N=2000:82.7%) decreased significantly and was lower than that of GBM. The larger the sample size was, the better the results were (Table 1 and Figure 1).

The balance of baseline covariates:
The distribution of ASAM in the original sample was highly imbalanced between the two groups (scenario 1: average ASAM>0.2; scenario 2: average ASAM>0.4). When the conditioning model was linear, balanced baseline covariates could be achieved. Compared with GBM, the average ASAM of logistic regression was low (N=500:0.048, N=2000:0.025). When the conditioning model was nonlinear, the ability to achieve covariate balance based on GBM was better than that of logistic regression, and the average ASAM of logistic regression (N=500:0.738, N=2000: 0.452) increased significantly, even higher than in the original sample (N=500:0.421, N=2000:0.273). In addition, logistic regression had a large number of high outliers in both scenarios (Table 1 and Figure 2).

Distribution of weights:
When the conditioning model was linear, the distributions of weights by logistic regression and GBM were similar under the same sample size, all of which are centred on 1. When the conditioning model was nonlinear, logistic regression generated a large number of extremely high weights under different sample sizes. The average weights were 22.07 and 5.79, and the average maximum weights were 10126.87 and 7462.96, respectively. By contrast, the weights generated by GBM remained stable. The average weights were 1.45 and 1.56, and the average maximum weights were 16.17 and 44.09, respectively ( Table 2).

Case study
Outpatient data of patients with CHD were selected to examine the performance of IPTW with the different methods for estimating ATE. The outpatient data from a tier-3 hospital in Chongqing collected from January 2016 to December 2018 consisted of demographic information and drug costs of patients. Inclusion criteria: hospital admissions of CHD were identified based on the first diagnosis with International Classification of Diseases 10 th revision (ICD-10) codes of I20-I25. Exclusion criteria: combination of other diseases. The study was intended to analyse the independent impact of different types of medical insurance on drug costs in outpatients with CHD, because the cost is also affected by other demographic factors, so propensity score weighting is used to control other confounding factors. The outcome variables of this study were the drug costs of outpatients with CHD, exposure variables were different types of medical insurance, potential confounding factors included gender, age, category of drug, drug's zero-profit policy doctor's title and type of department and 16575 people were included, of which 15136 were ensured with Urban Employee Basic Medical Insurance (UEBMI) and 1439 with Urban-Rural Resident Basic Medical Insurance (URRBMI).    (P<0.05). After IPTW with logistic regression, the maximum and average ASAM were 0.090 and 0.030, respectively. The average weight was 2.00, and the category of drugs and drug's zero-profit policy were different between the two groups (P<0.05). After IPTW with GBM, the maximum and average ASAM were 0.081and 0.027, respectively. The average weights were 1.95, and no difference in any variables between the two groups was observed (P>0.05) ( Table 3). GBM performed better than logistic regression, so IPTW with GBM was used in the subsequent analysis.

Estimation of ATE:
After weighting, the drug costs of outpatients with UEBMI increased by 256.35 Yuan on average compared with those of outpatients with URRBMI. The difference was statistically significant (Table 4).

Sensitivity analysis:
For sensitivity analysis, we repeated the analyses after excluding patients in the tails of the distribution of propensity scores [38]. And the conclusions were consistent with those of the primary analyses (Table 5).

Discussion and Conclusion
The main objective of IPTW is to achieve covariate balance between treatment and non-treatment groups to obtain an effective estimation of the treatment effect. Most methods use logistic regression to predict propensity scores, but the model assumptions may not be valid. Therefore, the main purpose of this study is to compare the ability of logistic regression and GBM to estimate the treatment effect and covariate balance in propensity score weighting under linear or nonlinear conditioning models. Although Jacqueline M Burgette, et al., Xin M, Fullerton B and other researchers [39][40][41] compared logistic regression with GBM, they did not consider the interaction effect and quadratic relationships among the treatment variable and covariates. Moreover, this study assessed the covariate balance and distribution of weights to make the results more reliable. In addition, the design ensured that no hidden bias existed in estimating the conditioning model, that is, the source of the difference between the two groups could be determined and propensity score weighting could be used to control the bias.
The results showed that both logistic regression with only main effects and GBM usually achieved covariate balance and provided acceptable estimation of treatment effects, regardless of sample size. The distribution of weights was centralized with 1 as the centre, the 95% CI coverage rate was close to 95%, and the accuracies of the models were similar. Abdia Y, et al., Pirracchio R, et al. [42,43] also noted that logistic regression produced lower ASAM and smaller bias than did machine learning under a linear model. However, when the model did not consider the interaction effect and quadratic terms, regardless of sample size, the ASAM values estimated by logistic regression were higher than those of the unweighted values. Moreover, the skewed distribution produced a large number of high outliers and failed to achieve balanced covariates. These results were similar to those observed by Abdia Y, et al., Lee BK, et al. and his colleagues [42,44], where the estimated weights were highly scattered. For example, when N=2000, the range was from 1 to 7462.96. The main reason may be that the degree of overlap on covariates between two groups is not high, and only a few members of any group can replace other members in the other group. Another study showed that if the estimated weights are highly variable, the weights might  lead to poor accuracy of the ATE [45][46][47][48]. The bias in this scenario was approximately 3 times larger than that in the linear scenario, the standard error was approximately 2 times larger, the MSE was approximately 8 times larger, and the 95% CI coverage rate was reduced by 10%. By contrast, the ASAM estimated by GBM remained steady and had a more symmetrical distribution. GBM provided good performance in terms of covariate balance. The weight distribution remained stable (e.g., when N=500, the range was from 1 to 16.17), the 95% CI coverage rate was still greater than 90%, and other indicators were slightly worse. However, the performance of these indicators was better when N=2000, possibly because machine learning itself was more suitable for larger sample sizes. In addition, when the nonlinear relationship between covariates in the conditioning model was excessively complex, the logistic regression model could not be fitted.
These results indicate that it may be important to balance covariates with interaction effects and quadratic terms, especially when the real model itself has interaction effects and quadratic terms. Alam [45,52]. GBM is a strong nonparametric learning algorithm [53]. Many of its features are beneficial to propensity score methods. The regression tree, a basic classification algorithm, is iterated continuously to find the optimal function combination form and to prevent over fitting [54]. A smaller shrinkage coefficient can be used to reduce the variability. By optimizing the number of iterations and piecewise constant model, the variability of propensity score is reduced, and the generated weights are more uniform [44]. Therefore, it may be better to use IPTW with GBM to control confounding factors in the case of an unknown relationship between the treatment variable and covariates.
In the empirical study, GBM was superior to logistic regression in terms of ASAM and average weights in IPTW. Two variables remained imbalanced between the two groups by IPTW with logistic regression, whereas all variables were balanced by IPTW with GBM. The outcome analysis showed that the drug costs of outpatients with CHD with UEBMI were 256.35 Yuan higher on average than those of outpatients with URRBMI. The proportion of medical insurance reimbursement for urban employees is higher than that for urban-rural residents, which leads to higher demand for medical services. The basic purpose of medical insurance is to lighten the economic burden of disease on individuals and reflect fairness [55,56]. Therefore, the gap between different types of medical insurance must be narrowed [57,58]. Medical staff should standardize the behaviour of diagnosis and treatment and rationally used rugs to reduce the burden of patients, especially for patients with chronic diseases who take medicines for a long time. While ensuring the quality of medical care, it is responsible to help patients clarify relevant medical insurance policies and make rational use of health resources. This study focuses on the performance of logistic regression and GBM in propensity score weighting without considering other propensity score methods, and only the interaction effect and quadratic terms are included in this simulation study without higher-order polynomials. These considerations must be addressed in future research.