top of page

See what was discussed at the 2023 UK Stata Conference.

Customized Markdown and .docx tables using listtab and docxtab.

Roger Newson
Cancer Prevention Group, School of Cancer & Pharmaceutical Sciences, King's College London

Statisticians make their living producing tables (and plots). We present an update of a general family of methods for making customized tables, called the DCRIL path (decode, characterize, reshape, insert, list), with customized table cells (using the package sdecode), customized column attributes (using the package chardef), customized column labels (using the package xrewide), and/or customized inserted gap-row labels (using the package insingap), and listing these tables to automatically-generated documents. This demonstration uses the package listtab to list Markdown tables for browser-ready HTML documents, which Stata users like to generate, and the package docxtab to list .docx tables for printer-ready .docx documents, which our superiors like us to generate.
Multiply imputing informatively censored time-to-event data.
Ian R. White
Ian R. White and Patrick Royston, MRC Clinical Trials Unit at UCL, London, UK


Time-to-event data, such as overall survival in a cancer clinical trial, are commonly right-censored, and this censoring is commonly assumed to be non-informative. While non-informative censoring is plausible when censoring is due to end of study, it is less plausible when censoring is due to loss to follow-up. Sensitivity analyses for departures from the non-informative censoring assumption can be performed using multiple imputation under the Cox model [1]. These have been implemented in R [2] but are not commonly used. We propose a new implementation in Stata.

Our existing -stsurvimpute- command (on SSC) imputes right-censored data under non-informative censoring, using a flexible parametric survival model fitted by -stpm2-. We extend this to allow a sensitivity parameter gamma, representing the log of the hazard ratio in censored individuals versus comparable uncensored individuals (the informative censoring hazard ratio, ICHR). The sensitivity parameter can vary between individuals, and imputed data can be re-censored at the end-of-study time. Because the -mi- suite does not allow imputed variables to be -stset-, we create an imputed data set in -ice- format and analyse it using -mim-.

In practice, sensitivity analysis computes the treatment effect for a range of scientifically plausible values of gamma. We illustrate the approach using a cancer clinical trial.



  • [1] Jackson D, White IR, Seaman S, Evans H, Baisley K, Carpenter J. Relaxing the independent censoring assumption in the Cox proportional hazards model using multiple imputation. Statistics in medicine. 2014; 33: 4681-94.

  • [2]

Influence Analysis with Panel Data using Stata.​
Annalivia Polselli, 
Institute for Analytics and Data Science (IADS), and Centre for Micro-Social Change (MiSoC), University of Essex

The presence of units that possess extreme values in the dependent variable and/or independent variables (i.e., vertical outliers, good and bad leverage points) has the potential to severely bias least squares (LS) estimates – i.e., regression coefficients and/or standard errors.


Diagnostic plots (such as leverage-versus-squared residual plots) and measures of overall influence (e.g., Cook (1979)’s distance) are usually used to detect such anomalies, but there are two different problems arising from their use. First, available commands for diagnostic plots are built for cross-sectional data, and some data manipulation is necessary for panel data. Second, Cook-like distances may fail to flag multiple anomalous cases in the data because they not account for pair-wise influence of observations (Atkinson, 1993; Chatterjee and Hadi, 1988, Rousseeuw, 1991; Rousseeuw and Van Zomeren, 1990, Lawrance, 1995). I overcome these limits as follows. First, I formalise statistical measures to quantify the degree of leverage and outlyingness of units in a panel data framework to produce diagnostic plots suitable for panel data. Second, I build on Lawrance (1995)'s pair-wise approach by proposing measures for joint and conditional influence suitable for panel data models with fixed effects.


I develop a method to: (i) visually detect anomalous units in a panel data set and identify their type; (ii) investigate the effect of these units on LS estimates, and on other units’ influence. I propose two user-written commands in Stata to implement this method. xtlvr2plot produces a leverage-versus-residual plot suitable for panel data, and a summary table with the list of detected anomalous units and their type. xtinfluence calculates the joint and conditional influence and effects of pairs of units, and generates network-style plots (an option between scatter plot or heat plot is allowed by the command).


  • JEL codes: C13, C15, C23.

A suite of programs for the design, development and validation of clinical prediction models.​
Joie Ensor
Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK

An ever-increasing number of research questions focus on the development and validation of clinical prediction models to inform individual diagnosis and prognosis in healthcare. These models predict outcome values (e.g., pain intensity) or outcome risks (e.g., 5-year mortality risk) in individuals from a target population (e.g., pregnant women; cancer patients). Development and validation of such models is a complex process, with a myriad of statistical methods, validation measures and reporting options. It is therefore not surprising that there is considerable evidence of poor methodology in such studies.


In this talk I will introduce a suite of ancillary software packages with the prefix 'pm'. The pm-suite of packages aim to facilitate the implementation of methodology for building new models, validating existing models and transparent reporting. All packages are in line with the recommendations of the TRIPOD guidelines, which provide a benchmark for the reporting of prediction models.


I will showcase a selection of packages to aid in each stage of the life cycle of a prediction model, from the initial design (e.g., sample size calculation using pmsampsize and pmvalsampsize), to development and internal validation (e.g., calculating model performance using pmstats), external validation (e.g., flexible calibration plots of performance in new patients using pmcalplot), and model updating (e.g., comparing updating methods using pmupdate).


Through an illustrative example I will demonstrate how these packages allow researchers to perform common prediction modelling tasks quickly and easily while standardising methodology.

Bayesian Model Averaging

Yulia Marchenko

Model uncertainty accompanies many data analyses. Stata's new bma suite that performs Bayesian model averaging (BMA) helps address this uncertainty in the context of linear regression. Which predictors are important given the observed data? Which models are more plausible? How do predictors relate to each other across different models? BMA can answer these and more questions. BMA uses the Bayes theorem to aggregate the results across multiple candidate models to account for model uncertainty during inference and prediction in a principled and universal way. In my presentation, I will describe the basics of BMA and demonstrate it with the bma suite. I will also show how BMA can become a useful tool for your regression analysis, Bayesian or not!
Prioritizing clinically important outcomes using the win ratio.​
John Gregson
John Gregson, Tim Colllier, Joan Pedro Ferreira, Department of Medical Statistics, London School of Hygiene and Tropical Medicine


The win ratio is a statistical method used for analyzing composite outcomes in clinical trials.  Composite outcomes are composed of two or more distinct ‘component’ events (e.g. heart attacks, death) and are often analyzed using time-to-first event methods ignoring the relative importance of the component events. When using the win ratio, component events are instead placed into a hierarchy from most to least important, more important components can then be prioritized over less important outcomes (e.g., death, followed by myocardial infarction). The method works by first placing patients into pairs. Within each pair, one evaluates the components in order of priority starting with the most important until one of the pair is determined to have a better outcome than the other.


A major advantage of the approach is it's flexibility: one can include in the hierarchy outcomes of different types (e.g., time-to-event, continuous, binary, ordinal, and repeat events). This can have major benefits, for example by allowing assessment of quality of life or symptom scores to be included as part of the outcome. This is particularly helpful in disease areas where recruiting enough patients for a conventional outcomes trial is unfeasible.

The win ratio approach is increasingly popular, but a barrier to more widespread adoption is a lack of good statistical software. The calculation of sample sizes is also complex and usually requires simulation. We present winratiotest, the first package to implement win ratio analyses in Stata. The command is flexible and user-friendly. Included in the package is the first software (we know of) that can calculate the sample size for win-ratio based trials without requiring simulation.

Object - Oriented Programming in Mata

Daniel C. Schneider

Max Planck Institute for Demographic Research, Rostock, Germany.

Object-oriented programming (OOP) is a programming paradigm that ubiquitous in today's landscape of programming languages. OOP code proceeds by first defining separate entities - classes - and their relationships, and then lets them communicate with each another. Mata, Stata's matrix language, does have such OOP capabilities. In comparison to some other programming languages that are object-oriented, like Java or C++, Mata offers a lighter implementation, but does so by striking a nice balance between feature availability and language complexity.


This talk explores OOP features in Mata by describing the code behind -dtms-, a community-contributed package for discrete-time multistate model estimation.*** Estimation in -dtms- proceeds in several steps, where each step can nest multiple results of the next level, thus building up a tree-like structure of results. The talk explains how this tree-like structure is implemented in Mata using OOP, and what the benefits of using OOP for this task are. These include easier code maintenance via a more transparent code structure, shorter coding time, and an easier implementation of efficient calculations.


The talk will at first provide simple examples of useful classes; e.g., a class that represents a Stata matrix in Mata, or a class that can grab, hold, and restore Stata e()-results. More complex relationships among classes will then be explored in the context of the tree-like results structure of -dtms-. While topics covered will include such technically sounding concepts as class composition, self-threading code, inheritance, and polymorphism, an effort will be made to link these concepts to tasks that are relevant to Stata users that have already gained or are interested in gaining an initial proficiency of Mata.

A Review of Machine Learning Commands in Stata: Performance and Usability Evaluation

Giovanni Cerulli
CNR-IRCRES, National Research Council of Italy, Research Institute on Sustainable Economic Growth

This paper provides a comprehensive survey reviewing machine learning (ML) commands in Stata. I systematically categorize and summarize the available ML commands in Stata and evaluate their performance and usability for different tasks such as classification, regression, clustering, and dimension reduction. I also provide examples of how to use these commands with real-world datasets and compare their performance. This review aims to help researchers and practitioners choose appropriate ML methods and related Stata tools for their specific research questions and datasets, and to improve the efficiency and reproducibility of ML analyses using Stata. I conclude by discussing some limitations and future directions for ML research in Stata.
On the shoulders of giants: Writing wrapper commands in Stata​
Sebastian Kripfganz
Univeristy of Exeter

For repeated tasks, it is convenient to use commands with simple syntax which carry out more complicated tasks under the hood. These can be data management and visualization tasks or statistical analyses. Many of these tasks are variations or special cases of more versatile approaches. Instead of reinventing the wheel, wrapper commands build on the existing capabilities by 'wrapping' around other commands. For example, certain types of graphs might require substantial effort when building them from scratch using Stata's -graph twoway- commands, but this process can be automized with a dedicated command. Similarly, many estimators for specific models are special cases of more general estimation techniques, such as maximum likelihood or generalized method of moments estimators. A wrapper command can be used to translate relatively simple syntax into the more complex syntax of Stata's -ml- or -gmm- commands, or even directly into the underlying -optimize()- or -moptimize()- Mata functions. Many official Stata commands can be regarded as wrapper commands, and often there is a hierarchical wrapper structure with multiple layers. For example, most commands for mixed-effects estimation of particular models are wrappers for the general -meglm- command, which itself just wraps around the undocumented -_me_estimate- command, which then calls -gsem-, which in turn initiates the estimation with the -ml- package. The main purpose of the higher-layer wrappers is typically syntax parsing. With every layer the initially simple syntax is translated into the more general syntax of the lower-layer command, but the user only needs to be concerned with the basic syntax of the lop-layer command. Similarly, community-contributed commands often wrap around official or other community-contributed commands. They may even wrap around packages written for other programming environments, such as Python.


In this talk, I discuss different types of wrapper commands and focus on practical aspects of their implementation. I illustrate these ideas with two of my own commands. The new -spxtivdfreg- wrapper adds a spatial dimension to the -xtivdfreg- command (Kripfganz and Sarafidis, 2021) for defactored instrumental variable estimation of large panel data models with common factors. The -xtdpdgmmfe- wrapper provides a simplified syntax for the GMM estimation of linear dynamic fixed-effects panel data models with the -xtdpdgmm- command.

gigs package - new egen extensions for international newborn and child growth standards

Simon Parker
Simon Parker (1), Linda Vesel (2), Eric Ohuma (1)
(1) -MARCH Centre, London School of Hygiene and Tropical Medicine, (2)-Ariadne Labs, Harvard T.H. Chan School of Public Health / Brigham and Women's Hospital, Boston, Massachusetts, USA

Children’s growth status is an important measure commonly used as a proxy indicator of advancements in a country’s health, human capital and economic development. Understanding how and why child growth patterns have changed is necessary for characterising global health inequalities. Sustainable development goal 3.2 aims to reduce preventable newborn deaths by at least 12 deaths per 1000 live births and child deaths to 25 per 1000 live births (WHO/UNICEF, 2019). However, large gaps remain in achieving these goals: currently 54 and 64 (of 194) countries will miss the targets for child (<5 years) and neonatal (<28 days) mortality, respectively (UN IGME, 2022). As infant mortality is associated strongly with non-optimal growth, accurate growth assessment using prescriptive growth standards is essential to reduce these mortality gaps. 


A range of standards can be used to analyse infant growth: In newborns, size-for-gestational age analysis of different anthropometric measurements is possible using the Newborn Size standards from the International Fetal and Newborn Growth Consortium for the 21st Century (INTERGROWTH-21st) project (Villar et al., 2014). In infants, growth analysis depends on whether the child is born preterm or term: for term infants, the WHO Child Growth Standards are appropriate (WHO MGRS Group, 2006), whereas there are INTERGROWTH-21st standards for post-natal growth in preterm infants (Villar et al., 2018). Unfortunately, many researchers apply these standards incorrectly, which can lead to inappropriate interpretations of growth trajectories (Perumal et al., 2022).


As part of the Guidance for International Growth Standards (GIGS) project, we are making a range of these tools available in Stata to provide explicit, evidence-based functions through which these standards can be implemented in research and clinical care. We therefore introduce several egen extensions for converting between anthropometric measurements and centiles/z-scores in WHO and INTERGROWTH-21st standards. We also describe several egen functions which classify newborn size and infant growth according to international growth standards.



  • Perumal, N., E. O. Ohuma, A. M. Prentice, P. S. Shah, A. Al Mahmud, S. E. Moore, D. E. Roth. 2022. Implications for quantifying early life growth trajectories of term-born infants using INTERGROWTH-21st newborn size standards at birth in conjunction with World Health Organization child growth standards in the postnatal period. Paediatric and Perinatal Epidemiology (36)6: 839–850.

  • United Nations Inter-agency Group for Child Mortality Estimation (UN IGME). 2023. Levels & Trends in Child Mortality: Report 2022, Estimates developed by the United Nations Inter-agency Group for Child Mortality Estimation, United Nations Children’s Fund, New York.

  • Villar, J., L. C. Ismail, C. G. Victora, E. O. Ohuma, E. Bertino, D. G. Altman, A. Lambert, A. T. Papageorghiou et al. 2014. International standards for newborn weight, length, and head circumference by gestational age and sex: The Newborn Cross-Sectional Study of the INTERGROWTH-21st Project. The Lancet 384(9946): 857–868.

  • Villar, J., F. Giuliani, Z. A. Bhutta, E. Bertino, E. O. Ohuma, L. C. Ismail, F. C. Barros, D. G. Altman, et al. 2015. Postnatal growth standards for preterm infants: The Preterm Postnatal Follow-up Study of the INTERGROWTH-21st Project. The Lancet Global Health 3(11): e681–e691.

  • WHO Multicentre Growth Reference Study Group. 2006. WHO Child Growth Standards based on length/height, weight and age. Acta Paediatrica Suppl. 450: 76-85.

  • WHO/UNICEF. 2019. WHO/UNICEF discussion paper: The extension of the 2025 maternal, infant and young child nutrition targets to 2030. May 15th, 2023).

Plot suite – fast graphing commands for very large datasets 
Jan Kabatek
Melbourne Institute of Applied Economic and Social Research

This presentation showcases the functionality of the new ‘plot suite’ of graphing commands. The suite excels in visualizing very large datasets, enabling users to produce a variety of highly-customizable plots in a fraction of time required by Stata's native graphing commands.

pystacked and ddml: machine learning for prediction and causal inference in Stata
Mark Schaffer
Achim Ahrens (1) , Christian B. Hansen (2) , Mark E Schaffer (3) , Thomas Wiemann (4)
(1) - ETH Zürich, (2) - University of Chicago, (3) - Heriot-Watt University, (4) - University of Chicago

pystacked implements stacked generalization (Wolpert 1992) for regression and binary classification via Python’s scikit-learn.


Stacking is an ensemble method that combines multiple supervised machine learners — the "base" or "level-0" learners — into a single learner. The currently-supported base learners include regularized regression (lasso, ridge, elastic net), random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multilayer perceptron). pystacked can also be used to fit a single base learner and thus provides an easy-to-use API for scikit-learn’s machine learning algorithms.


ddml implements algorithms for causal inference aided by supervised machine learning as proposed in "Double/debiased machine learning for treatment and structural parameters" (Econometrics Journal 2018). Five different models are supported, allowing for binary or continuous treatment variables and endogeneity in the presence of high-dimensional controls and/or instrumental variables. ddml is compatible with many existing supervised machine learning programs in Stata, and in particular has integrated support for pystacked, making it straightforward to use machine learner ensemble methods in causal inference applications.

Fitting the Skellam distribution in Stata
Vincenzo Verardi
Université libre de Bruxelles

The Skellam distribution is a discrete probability distribution related to the difference between two independent Poisson-distributed random variables. It has been used in a variety of contexts, including sports or supply and demand imbalances in shared transportation. To the best of our knowledge, Stata does not support the Skellam distribution or the Skellam regression. In this talk we plan to show how to fit the parameters of a Skellam distribution and Skellam regression using Mata’s optimize function. The optimization problem is then packaged into a basic Stata command that we plan to describe.
A short report on making Stata secure, and adding metadata, in a new data platform. 
Bjarte Aagnes
Cancer Registry of Norway, Department of Registration

The presentation has two parts. 



Part 1

Securing Stata in a secure environment. Data access and logging.

At CRN we develop a secure environment for using Stata. A short description of this work is given describing the data access and logging of data extraction (JDBC + Java plugins) and Stata commands.



Part 2

Metadata using characteristics.

In the new solution metadata is automatically attached to Stata dta characteristic when a user fetch data from the data warehouse. The implementation is described, along with some small utility programs to use metadata, and examples of use are presented.

Facilities for optimising and designing multi-arm multi-stage (MAMS) randomised controlled trials with binary outcomes.​
Babak Choodari-Oskooei

Babak Choodari-Oskooei (1), Daniel J. Bratton (2), Mahesh KB Parmar (1)

1) MRC Clinical Trials Unit at UCL, University College London, UK, 2) Statistics and Data Science Innovation Hub, GlaxoSmithKline, Stevenage, UK

In this talk, we introduce two Stata programs nstagebin and nstagebinopt which can be used to facilitate the design of multi-arm multi-stage (MAMS) trials with binary outcomes. MAMS designs are a class of efficient and adaptive randomised clinical trials that have successfully been used in many disease areas, including cancer, TB, maternal health, COVID-19, and surgery. The nstagebinopt program finds a class of efficient “admissible” designs based on an optimality criterion using a systematic search procedure. The nstagebin command calculates the stagewise sample sizes, trial timelines, and the overall operating characteristics of MAMS design with binary outcomes. Both programs allow the use of Dunnett's correction to account for multiple testing. We also use the ROSSINI 2 MAMS design, an ongoing MAMS trial in surgical wound infection, to illustrate the capabilities of both programs. The new Stata programs facilitate the design of MAMS trials with binary outcomes where more than one research question can be addressed under one protocol.




  • Choodari-Oskooei B, Bratton DJ, Parmar M. 2023. Facilities for optimising and designing multi-arm multi-stage (MAMS) randomised controlled trials with binary outcomes. Stata Journal, In press

How to check a simulation study
Tra My Pham
Ian R White, Tra My Pham, Matteo Quartano, Tim P Morris,
MRC Clinical Traials Unit, UCL, London, UK

Simulation studies are a powerful tool in biostatistics, but they can be hard to conduct successfully. Sometimes unexpected results are obtained. We offer advice on how to check a simulation study when this occurs, and how to design and conduct the study to give results that are easier to check. Simulation studies should be designed to include some settings where answers are already known. Code should be written in stages and data generating mechanisms should be checked before simulated data are analysed. Results should be explored carefully, with scatterplots of standard error estimates against point estimates a surprisingly powerful tool. When estimation fails or there are outlying estimates, these should be identified, understood, and dealt with by changing data generating mechanisms or coding realistic hybrid analysis procedures. Finally, we give a series of ideas that have been useful to us in the past for checking unexpected results. Following our advice may help to prevent errors and to improve the quality of published simulation studies. We illustrate the ideas with a simple but realistic simulation study in Stata.

Drivers ​of COVID-19 deaths in the United States: A two-stage modeling approach
Kit Baum
Kit Baum (1) , Andrés Garcia-Suaza (2) , Miguel Henry (3) , Jesús Otero (4).
(1) - Boston College, (2) - University del Rosario, (3) - Greylock McKinnon Associates, (4) - Universidad del Rosario

We offer a two-stage (time-series and cross-section) econometric modeling approach to examine the drivers behind the spread of COVID-19 deaths across counties in the United States. Our empirical strategy exploits the availability of two years (January 2020 through January 2022) of daily data on the number of confirmed deaths and cases of COVID-19 in the 3,000 U.S. counties of the 48 contiguous states and the District of Columbia. In the first stage of the analysis, we use daily time-series data on COVID-19 cases and deaths to fit mixed models of deaths against lagged confirmed cases for each county. As the resulting coefficients are county specific, they relax the homogeneity assumption that is implicit when the analysis is performed using geographically aggregated cross-section units. In the second stage of the analysis, we assume that these county estimates are a function of economic and sociodemographic factors that are taken as fixed over the course of the pandemic. Here we employ the novel one-covariate-at-a-time variable-selection algorithm proposed by Chudik et al. (Econometrica, 2018) to guide the choice of regressors. The second stage utilizes the SUR technique in an unusual setting, where the regression equations correspond to time periods in which cross-sectional estimates at the county level are available.

Use of Stata in modeling the determinants of work engagement​
Paulina Hojda
Doctoral School of Social Sciences, University of Łodź, Poland

The research goal was to identify the determinants of the phenomenon of work engagement. Two primary datasets were used provided by Eurofound in the European Working Conditions Survey. Data were gathered before and during pandemic Covid-19, which allowed to include the pandemic context into analysis. Additionally, some macroeconomic and other social variables were included such as GDP per capita, labour force participation rate, unemployment rate, the level of social trust, Doing Business Index and European Quality of Government Index. Stata, with its potential for data cleaning and checking, allowed to merge all variables from complex datasets into one set with 115,608 observations and over one hundred variables from 34 European countries. During preparing the data some repetitive trucks of commands were applied. Stata programmability helped in preparing the model using logistic regression method. Dichotomous outcome (dependent) variable was modelled – engaged or not engaged into work. The predictor variables of interest were those related to work, such as working conditions, occupational characteristics, and the level of human capital. The logistic command in Stata produced results in terms of odds ratio which were interpreted to calculate the effect of chosen predictors on the response variable and consequently to take or reject the constructed research hypothesis. The innovation of presented analysis lies in including macroeconomic or macrosocial variables and consideration of the international and intersectoral analysis. The presented logit model provided by Stata possibilities fills the research gap in the area of work engagement phenomenon.

Heterogeneous difference-in-difference estimation
Enrique Pinzón, StataCorp

Treatment effects might differ over time and for groups that are treated at different points in time, treatment cohorts. In Stata 18, we introduced two commands that estimate treatment effects that vary over time and cohort. For repeated cross-sectional data, we have hdidregress. For panel data, we have xthdidregress. Both commands let you graph the evolution of treatment over time. They also allow you to aggregate treatment within cohort and time and visualize these effects. I will show you how both commands work and briefly discuss the theory underlying them.

A robust test for linear and log-linear models against Box-Cox alternatives
David Vincent, David Vincent Econometrics

The purpose of this presentation is to describe a new command xtloglin, which tests the suitability of the linear and log-linear regression models against Box-Cox alternatives. The command uses a GMM-based Lagrange Multiplier test, which is robust to non-normality and heteroskedasticity of the errors and extends the analysis by Savin and Würtz (2005) to panel data regressions after xtreg.


The Box-Cox transformation, first introduced by Box and Cox (1964), is a popular approach for testing the linear and log-linear functional forms, as both are special cases of the transformation. The usual approach is to estimate the Box-Cox model by maximum likelihood, assuming normally distributed homoskedastic errors and test the restrictions on the transformation parameter, that lead to linear and log-linear specifications using a Wald or likelihood ratio test.


Despite the popularity of this approach, the estimator of the transformation parameter is not just restricted to the search for non-linearity, but also to one that leads to more normal errors, with constant variance. This can result in an estimate that favours log-linearity over linearity even though the true model is linear with non-normal or heteroskedastic errors. These issues are resolved by xtloglin, as the GMM-estimator is consistent under less restrictive distributional assumptions.




  • Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243.

  • Savin, N. E., & Würtz, A. H. (2005). Testing the semiparametric Box–Cox model with the bootstrap. Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, 322-354.

Network regressions in Stata
Jan Ditzen
Free Univeristy of Bozen-Bolzano

Network analysis has become critical to the study of social sciences. While several Stata programs are available for analyzing network structures, programs that execute regression analysis with a network structure are currently lacking. We fill this gap by introducing the nwxtregress command. Building on spatial econometric methods (LeSage and Pace 2009), nwxtregress uses MCMC estimation to produce estimates of endogenous peer effects, as well as own-node (direct) and cross-node (indirect) partial effects, where nodes correspond to cross-sectional units of observation, such as firms, and edges correspond to the relations between nodes. Unlike existing spatial regression commands (for example, spxtregress), nwxtregress is designed to handle unbalanced panels of economic and social networks. Networks can be directed or undirected with weighted or unweighted edges, and they can be imported in a list format that does not require a shapefile or a Stata spatial weight matrix set by spmatrix. A special focus of the presentation will be put on the construction of the spatial weight matrix and integration with Python to improve speed.

The joy of sets: Graphical alternatives to Euler and Venn diagrams
Nicholas J. Cox
Nicholas J. Cox (1) , Tim P. Morris (2)
(1) - Durham University, UK, (2) - MRC Clinical Trials Unit, UCL, UK

Given several binary (indicator) variables and intersecting sets, a Euler or Venn diagram may spring to mind, but even with only a few sets the collective pattern becomes hard to draw and harder to think about easily. In genomics and elsewhere so-called upsetplots (specialized bar charts for the purpose) have become popular recently as alternatives. This presentation introduces an implementation {\tt upsetplot}, a complementary implementation {\tt vennbar}, and associated minor extras and utilities.  Applications include examination of the structure of missing data and of the co-occurrence of medical symptoms or any other individual binary states. These new commands are compared with previous graphical commands, both official and community-contributed, and both frequently used and seemingly little known.

Secondary themes include: data structures needed to produce and store results; what works better with {\tt graph bar} and what works better with {\tt twoway bar}; and the serendipity of encounters at Stata users' meetings.

geoplot: A new command to draw maps
Ben Jann
Institute of Sociology, Univeristy of Bern

geoplot is a new command for drawing maps from shape files and other datasets. Multiple layers of elements such as regions, borders, lakes, roads, labels, and symbols can be freely combined and the look of elements (e.g., color) can be varied depending on the values of variables. Compared to previous solutions in Stata, geoplot provides more user convenience, more functionality, and more flexibility. In this talk I will introduce the basic components of the command and illustrate its use with examples.

bottom of page