Proceedings
See what was discussed at the 2024 UK Stata Conference.
Balance and variance inflation checks for completeness-propensity weights
​
Roger Newson
Cancer Prevention Group, School of Cancer & Pharmaceutical Sciences, King's College London
Inverse treatment-propensity weights are a standard method for adjusting for predictors of exposure to a treatment. As a treatment-propensity score is a balancing score, it makes sense to do balance checks on the corresponding treatment-propensity weights. It is also a good idea to do variance-inflation checks, to estimate how much the propensity weights might inflate the variance of an estimated treatment effect, in the pessimistic scenario in which the weights are not really necessary. In Stata, the SSC package somersd can be used for balance checks, and the SSC package haif can be used for variance-inflation checks. It is argued that balance and variance-inflation checks are also necessary in the case of completeness-propensity weights, which are intended to remove inbalance in predictors of completeness between the subsample with complete data and the full sample of subjects with complete or incomplete data. However, the usage ofsomersd, scsomersd, and haif must be modified, because we are removing imbalance between the complete sample and the full sample, instead of between the treated subsample and the untreated subsample. An example will be presented, from a clinical trial in which the author was involved, and in which nearly a quarter of randomized subjects had no final outcome data. A post-hoc sensitivity analysis is presented, using inverse completeness-propensity weights.
Using GitHub for collaborative analysis
​
Chloe Middleton-Dalby, Liane Gillespie-Akar
Adelphi Real World
Recent trends have led to an increased importance being placed upon formal quality control processes for analysis conducted within the pharmaceutical industry and beyond. While a key feature of Stata is reproducibility through do-files and automated reporting, there are limited built-in tools for version control, code review, and collaborative analysis.
Git is a distributed version control system, widely used by software development teams for collaborative programming, change tracking, and enforcement of best practices. Git keeps a record of all changes to a codebase over time, providing the ability to easily revert to a previous state, manage temporary branches, and combine code written by multiple people. Services such as GitHub build on the Git framework, providing tools to conduct code review, host source files, and manage projects.
We present an overview of Git and GitHub and explain how we use it for Stata projects at Adelphi Real World: an organisation specialising in the collection and analysis of real-world healthcare data from physicians, patients, and caregivers. We share an example project to outline the benefits of code review for both data integrity and as a training tool.
We also discuss how, through implementing a software-development-like approach to the creation of ado-files, we can enhance the process of creating new programs in Stata and gain confidence in the robustness and quality of our commands.
My favourite overlooked life savers in Stata​
​
Jan Ditzen
Free University of Bozen – Bolzano
​
Everyone loves a good testing, estimation, or graphical community contributed package. However, a successful empirical project relies on many small, overlooked but priceless programs. In this talk I will present three of my personal life savers.
1. adotools: adotools has four main uses. It allows the user to create and maintain a library of adopaths. Paths can be dynamically added to and removed from a running Stata session. When removing an ado-path, all ado-programs located in the folder are cleared from memory. adotools can also reset all user specified adopaths.
2. psimulate2: ever wanted to run Monte Carlo simulations in parallel? With psimulate2 you can and there are (almost) no setup costs at all. psimulate2 splits the number of repetitions into equal chunks, spreads them over multiple instances of Stata and reduces the time to run Monte Carlo simulations. It also allows macros to be returned and can save and append simulation results directly into a dta or frame. It can be run on Windows, Unix and Mac.
3. xtgetpca: Extracting principal components in panel data is common, however no Stata solution exists. xtgetpca fills this gap. It allows for different types of standardization, removal of fixed effects and unbalanced panels.
Professional statistical development: What, why, and how
​
Yulia Marchenko
Vice President of Statistics and Data Science, StataCorp
​
In this presentation, I will talk about professional statistical software development in Stata and the challenges of producing and supporting a statistical software package. I will share some of my experience on how to produce high-quality software, including verification, certification, and reproducibility of the results, and on how to write efficient and stable Stata code. I will also discuss some of the aspects of commercial software development such as clear and comprehensive documentation, consistent specifications, concise and transparent output, extensive error checks, and more.
Stata: a short history viewed through epidemiology
Bianca de Stavola
University College London
​In this talk, I will use personal recollections to revisit the challenges many public health researchers have faced since the birth of Stata in 1985. I will discuss how, from the 1990s onwards, the increasing demands for data management and analysis were met by Stata developers and the broader Stata community, particularly Michael Hills. Additionally, I will review how Stata's expansion in scope and capacity with each new version has enhanced our ability to train new generations of medical statisticians and epidemiologists. Finally, I will reflect on current and future challenges.
compmed: A new command for estimating causal mediation effects with non- adherence to treatment allocation
​
Anca Chis Ster, Sabine Landau, Richard Emsley
Kings College London
In clinical trials, a standard intention-to-treat analysis will unbiasedly estimate the causal effect of treatment offer, though ignores the impact of participant non- adherence. To account for this, one can estimate a complier-average causal effect (CACE), the average causal effect of treatment receipt in the principal strata of participants who would comply with their randomisation allocation. Evaluating how interventions lead to changes in the outcome (the mechanism) is also key for the development of more effective interventions. A mediation analysis aims to decompose a total treatment effect into an indirect effect, one that operates via changing the mediator, and a direct effect. To identify mediation effects with non- adherence, it has been shown that the CACE can be decomposed into a direct effect, the Complier-Average Natural Direct Effect (CANDE), and a mediated effect, the Complier-Average Causal Mediated Effect (CACME). These can be estimated with linear Structural Equation Models (SEMs) with Instrumental Variables. However, obtaining estimates of the CACME and CANDE in Stata requires (1) correct fitting of the SEM in Stata and (2) correct identification of the pathways that correspond to the CACME and CANDE. To address these challenges, we introduce a new command, compmed, which allows users to perform the relevant SEM fitting for estimating the CACME and CANDE using a single, more intuitive, and user-friendly interface. compmed requires the user to specify only the continuous outcome, continuous mediator, treatment receipt, and randomisation variables. Estimates, standard errors, and 95% confidence intervals are reported for all effects.
Causal Mediation
​
Kristin MacDonald
Executive Director of Statistical Services, StataCorp
​
Causal inference aims to identify and quantify a causal effect. With traditional causal inference methods, we can estimate the overall effect of a treatment on an outcome. When we want to better understand a causal effect, we can use causal mediation analysis to decompose the effect into a direct effect of the treatment on the outcome and an indirect effect through another variable, the mediator. Causal mediation analysis can be performed in many situations–the outcome and mediator variables may be continuous, binary, or count, and the treatment variable may be binary, multivalued, or continuous.
In this talk, I will introduce the framework for causal mediation analysis and demonstrate how to perform this analysis with the mediate command, which was introduced in Stata 18. Examples will include various combinations outcome, mediator, and treatment types.
Imputation when data cannot be pooled
​
Nicola Orsini, Robert Thiesmeier, Matteo Bottai
Karolinska Institutet
Distributed data networks are increasingly used to study human health across different populations and countries. Analyses are commonly performed at each study site to avoid the transfer of individual data between study sites due to legal and logistical barriers. Despite many benefits, however, a frequent challenge in such networks is the absence of key variables of interest at one or more study sites. Current imputation methods require the availability of individual data from the involved studies to impute missing data. This creates a need for methods that can impute data in one study using only information that can be easily and freely shared within a data network. To address this need, we introduce a new Stata command -mi impute from- designed to impute missing variables in a single study using a linear predictor and the related variance/covariance matrix from an imputation model estimated from one or multiple external studies. In this article, the syntax of -mi impute from- will be presented along with motivating examples from health-related research.
Thirty graphical tips Stata users should know, revisited
Nicholas J. Cox
University of Durham
​
In 2010 I gave a talk at the London presenting thirty graphical tips. The display materials remain accessible on Stata's website, but are awkward to view, as they are based on a series of .smcl files.
I will recycle the title, and some of the tips, and add new ones, as some of what you or your students or your research team should know about when coding graphics for mainstream tasks. The theme of "thirty" matches this 30th London meeting, and to a good enough approximation my 33 years as a Stata user.
The talk mixes examples from official and community-contributed commands and details both large and small.
Fancy graphics: Small multiples carpentry
Philippe Van Kerm
University of Luxembourg
​
Using 'small multiples' in data visualization and statistical graphics consists in combining repeated, small-sized diagrams to display variations in data patterns or associations across a series of units. Sometimes the small multiples are mere replications of identical plots, but with different plot elements highlighted. Small displays are typically arranged on a grid and the overall appearance is, as Tufte puts it, akin to the sequence of frames of a movie when ordering follows a time dimension. Creating diagrams for use in gridded 'small multiples' is easy with Stata's graphics combination commands. The grid pattern can however be limiting. The talk will present tips and tricks for building small multiple diagrams and illustrate some coding strategies for arranging individual frames in the most flexible way, opening up some creative possibilities of data visualization.
Scalable high-dimensional non-parametric density estimation, with Bayesian applications
​
Robert Grant
BayesCamp Ltd
​
Few methods have been proposed for flexible, non-parametric density estimation, and they do not scale well to high-dimensional problems. We describe a new approach based on smoothed trees called the kudzu density (Grant 2022). This fits the little-known density estimation tree (Ram & Gray 2011) to a dataset and convolves the edges with inverse logistic functions, which are in the class of computationally minimal smooth ramps.
New Stata commands provide tree fitting, kudzu tuning, estimates of joint, marginal and cumulative densities, and pseudo-random numbers. Results will be shown for fidelity and computational cost. Preliminary results will also be shown for ensembles of kudzu under bagging and boosting.
Kudzu densities are useful for Bayesian model updating where models have many unknowns, require rapid update, datasets are large, and posteriors have no guarantee of convexity and unimodality. The input “dataset” is the posterior sample from a previous analysis. This is demonstrated with a real-life large dataset. A new command outputs code to use the kudzu prior in bayesmh evaluators, BUGS and Stan.
Robust testing for serial correlation in linear panel-data models
​
Sebastian Kripfganz
University of Exeter Business School
​
Serial correlation tests are an essential part of standard model specification toolkits. For static panel models with strictly exogenous regressors, a variety of tests is readily available. However, their underlying assumptions can be very restrictive. For models with predetermined or endogenous regressors, including dynamic panel models, the Arellano and Bond (1991, Review of Economic Studies) test is predominantly used, but it has low power against certain alternatives. While more powerful alternatives exist, they are underused in the empirical practice. The recently developed Jochmans (2020, Journal of Applied Econometrics) portmanteau test yields substantial power gains when the time horizon is very short, but it can quickly lose its advantage even for time dimensions that are still widely considered as small.
I propose a new test based on a combination of short and longer differences, which overcomes this shortcoming and can be shown to have superior power against a wide range of stationary and nonstationary alternatives. It does not lose power as the process under the alternative approaches a random walk - unlike the Arellano-Bond test - and it is robust to large variances of the unit-specific error component - unlike the Jochmans portmanteau test. I present a new Stata command that flexibly implements these (and more) tests for serial correlation in linear error component panel data models. The command can be run as a postestimation command after a variety of estimators, including generalized method of moments, maximum likelihood, and bias-corrected estimation.
Estimating the wage premia of refugee immigrants
​
Kit Baum
Boston College
In this case study, we examine the wage earnings of fully-employed previous refugee immigrants in Sweden. Using administrative employer-employee data from 1990 onwards, about 100,000 refugee immigrants who arrived between 1980 and 1996 and were granted asylum are compared to a matched sample of native-born workers using coarsened exact matching. Employing recentered influence function (RIF) quantile regressions to wage earnings for the period 2011– 2015, the occupational-task-based Oaxaca–Blinder decomposition approach shows that refugees perform better than natives at the median wage, controlling for individual and firm characteristics. The RIF-quantile approach provides better insights for the analysis of these wage differentials than the standard regression model employed in earlier versions of the study.
The Oaxaca-Blinder decomposition in Stata: an update
​
Ben Jann
University of Bern
​
In 2008, I published the Stata command -oaxaca-, which implements the popular Oaxaca-Blinder (OB) decomposition technique. This technique is used to analyze differences in outcomes between groups, such as the wage gap by gender or race. Over the years, both the functionality of Stata and the literature on decomposition methods have evolved, so that an update of the -oaxaca- command is now long overdue. In this talk I will present a revised version of - oaxaca- that uses modern Stata features such as factor-variable notation and supports additional decomposition variants that have been proposed in the literature (e.g., reweighted decompositions or decompositions based on recentered influence functions).
Visualisations to evaluate and communicate adverse event data in randomised controlled trials
​
Rachel Phillips
Imperial College London
​
Introduction: Well-designed visualisations are a powerful way to communicate information to a range of audiences. In randomised controlled trials (RCT) where there is an abundance of complex data on harms (known as adverse events) visualisations can be a highly effective means to summarise harm profiles and identify potential adverse reactions. Trial reporting guidelines such as the CONSORT extension for harms encourage the use of visualisations for exploring harm outcomes, but research has demonstrated that their uptake is extremely low.
Methods: To improve the communication of adverse event data collected in RCTs we developed recommendations to help trialists decide which visualisations to use to present this data. We developed Stata commands (aedot and aevolcano) to produce two of the visualisations, the volcano and dot plot, to present adverse event data with the aim of easing implementation and promoting increased uptake.
Results: In this talk, using clinical examples, we will introduce and demonstrate application of these commands. We will contrast the produced visual summaries from the volcano and dot plot with traditional non-graphical presentations of adverse data with examples in the published literature, with the aim of demonstrating the benefits of graphical displays.
Discussion: Visualisations offer an efficient means to summarise large amounts of adverse event data from RCTs and statistical software eases the implementation of such displays. We hope that development of bespoke Stata commands to create visual summaries of adverse events will increase uptake of visualisations in this area by the applied clinical trial statistician.
Optimising adverse event analysis in clinical trials when dichotomising continuous harm outcomes
​
Victoria Cornelius, Odile Sauzet
Imperial College London
​​
Introduction: The assessment of harm in randomized controlled trials is vital to enable a risk-benefit assessment on the intervention under evaluation. Many trials undertake regular monitoring of continuous outcomes such as laboratory measurements, for example blood tests. Typical practice in a trial analysis is to dichotomise this type of data into abnormal/normal categories based on reference values. Frequently the proportion of participants with abnormal results between treatment arms are then compared using a Chi-squared or Fisher’s exact test reporting a p-value. As dicotomisation results in a substantial loss of information contained in the outcome distribution, this increases the chance of missing an opportunity to detect signals of harm.
Methods: A solution to this problem is to use the outcome distribution in each arm to estimate the between-arm difference in proportions of participants with an abnormal result. This approach has been developed by Sauzet et. al [1] and it protects against a loss of information and retains statistical power.
Results: In this talk I will introduce the distributional approach and associated Stata user written code distdicho[2]. I will compare the original analysis of blood test results from a small population drug trial in paediatric eczema to the results using the distributional approach and discuss inference from the trial based on these.
Implementing treatment-selection rules for multi-arm multi-stage trials using
nstage
​
Babak Choodari - Oskooei, Alexandra Blenkinsop & Mahesh KB Parmar
MRC Clinical Trials Unit at UCL
​
Multi-arm multi-stage (MAMS) randomised trial designs offer an efficient and practical framework for addressing multiple research questions. Typically, standard MAMS designs employ pre-specified interim stopping boundaries based on lack-of-benefit and/or over-whelming efficacy. To facilitate implementation, we have developed nstage suite of commands, which calculates the required sample sizes and trial timelines for a MAMS design.
In this talk, we introduce the MAMS selection design, integrating an additional treatment selection rule to restrict the number of research arms progressing to subsequent stages, in the event all demonstrate a promising treatment effect at interim analyses. The MAMS selection design streamlines the trial process by merging traditionally early-phase treatment selection with the late-phase confirmatory trial. As a result, it gains efficiency over the standard MAMS design by reducing overall trial timelines and required sample sizes. We present an update to the nstagebin Stata command which incorporates this additional layer of adaptivity, calculates required sample sizes, trial timelines, and overall familywise type I error rate and power for MAMS selection designs.
Finally, we illustrate how a MAMS selection design can be implemented using the nstage suite of commands and outline its advantages, using the ongoing trials in surgery (ROSSINI-2) and maternal health (WHO RED).
References:
1) Choodari-Oskooei, B, et al. (2024), "multi-arm multi-stage (MAMS) randomised selection designs" BMC Medical Research Methodology (in-press) doi: 10.1186/s12874-024-02247-w.
2) Choodari-Oskooei, B., et al. (2023), The Stata Journal 23(3), 744–798. doi: 10.1177/1536867X231196295
3) Choodari-Oskooei, B., et al. (2023), Clinical Trials 20(1), 71– 80.oi:10.1177/17407745221136527
Advanced Bayesian survival analysis with merlin and morgana
​
Michael Crowther
Red Door Analytics
​
In this talk I will describe our latest work to bring advanced Bayesian survival analysis tools to Stata. Previously, we have introduced the morgana prefix command (bayesmh in disguise), which provides a Bayesian wrapper for survival models fitted with stmerlin (which is merlin’s more user friendly wrapper designed for working with st data). We have now begun the work to sync morgana with the much more general merlin command, to allow for Bayesian multiple outcome models. Within survival analysis, multiple outcomes arise when we consider competing risks or the more general setting of multi-state processes. Using an example in breast cancer, I will show how to estimate competing risks and illness-death multi-state models within a Bayesian framework, incorporating prior information for covariate effects, and baseline hazard parameters. Importantly, we have also developed the predict functionality to obtain a wide range of easily interpretable predictions, such as cumulative incidence functions and (restricted) life-expectancy, along with their credible intervals.
codefinder: optimising Stata for the analysis of large, routinely collected healthcare data​
​
Jonathan Batty, Marlous Hall
University of Leeds
​​
Routinely collected healthcare data (including electronic healthcare records and administrative data) are increasingly available at the whole-population scale, and may span decades of data collection. These data may be analysed as part of clinical, pharmacoepidemiologic and health services research, producing insights that improve future clinical care. However, the analysis of healthcare data on this scale presents a number of unique challenges. These include the storage of diagnosis, medication and procedure codes using a number of discordant systems (including ICD-9 and 10, SNOMED-CT, Read codes, etc.) and the inherently relational nature of the data (each patient has multiple clinical contacts, during which multiple codes may be recorded). Pre-processing and analysing these data using optimised methods has a number of benefits, including minimisation of computational requirements, analytic time, carbon footprint and cost.
We will focus on one of the main issues faced by the healthcare data analyst: how to most efficiently collapse multiple, disparate diagnosis codes (stored as strings across a number of variables) into a discrete disease entity, using a pre-defined code list. A number of approaches (including the use of Boolean logic, the inlist function, string functions and regular expressions) will be sequentially benchmarked in a large, real-world healthcare dataset (n = 192 million hospitalisation episodes during a 12-year period; approximately 1 terabyte of data). The time and space complexity of each approach (in addition to its carbon footprint), will be reported. The most efficient strategy has been implemented into our newly-developed Stata command: codefinder, which will be discussed.
Data-driven decision making using Stata
​
Giovanni Cerulli
CNR-IRCRES
​
This presentation focuses on implementing a model in Stata for making optimal decisions in settings with multiple actions or options, commonly known as multi- action (or multi-arm) settings. In these scenarios, a finite set of decision options is available.
In the initial part of the presentation, I provide a concise overview of the primary approaches for estimating the reward or value function, as well as the optimal policy within the multi-arm framework. I outline the identification assumptions and statistical properties associated with optimal policy learning estimators. Moving on to the second part, I explore the analysis of decision risk. This examination reveals that the optimal choice can be influenced by the decision maker's risk attitude, specifically regarding the trade-off between the reward conditional mean and conditional variance.
The third part of the paper presents a Stata implementation of the model, accompanied by an application to real data.
Pattern matching in Stata: chasing the devil in the details
​
Mael Astruc-Le Souder
University of Bordeaux
​
The vast majority of quantitative statistics now have to be estimated through computer calculations. A computation script strengthens the reproducibility of these studies but requires carefulness from the researchers when writing their code to avoid various mistakes. This presentation introduces a command implementing some checks foreign to a dynamically typed language such as Stata in the context of data analysis. This command uses a new syntax, similar to switch or match expressions, to create a variable based on other variables in place of chains of 'replace' statements with 'if' conditions. More than the syntax, the real interest of this command lies in the two properties it checks for. The first one is exhaustiveness: do the stated conditions cover all the possible cases? The second one is usefulness: are all the conditions useful, or is there redundancy between branches? I borrow the present idea of pattern matching from the Rust programming language and the earlier implementation in the OCaml programming language of the algorithm detailed in Maranget (2017) [1]. The command and source code are available on GitHub [2].
​
[1] MARANGET L. Warnings for Pattern Matching Journal of Functional Programming. 2007;17(3):387–421. doi:10.1017/S0956796807006223 [2] https://github.com/MaelAstruc/stata_match
Relationships among recent difference-in-differences estimators and how to compute them in Stata​
​
Jeffrey Wooldridge
Michigan State University
​
I will provide an overview of the similarities and differences among popular estimators in the context of staggered interventions with panel data, illustrating how to compute the estimates, as well as interpret them, using built-in and community-contributed Stata commands.