General thoughts

1 Motivation

The purpose of a statistical analysis plan (SAP) - which will often also contain a (partial) data management plan - is to improve the quality of studies. The SAP will often be based on (or an appendix to) a protocol where the overall outline of a study is presented with fewer technical details.

1.1 The research question - what is the estimand?

Writing a detailed SAP forces the researcher/study group to think clearly about what the aim of the study is (which estimand1 needs to be estimated?), and how to achieve this aim (which estimator can provide the best estimate?). Possibly, the group will come to the conclusion that a reliable estimate cannot be made with the data at hand, in which case they can abandon the study and save themselves (along with reviewers, editors, taxpayers, and other innocent bystanders) a lot of time and resources which can then be spent on something fruitful.

Figure 1: xkcd.com/1838

Speaking from experience, it is not uncommon that a researcher wants an answer to a question that is so vague that several different analyses could be carried out, and all be said to provide a relevant answer to the overall question.

A question like “What is the occurrence of dementia in individuals with chronic kidney disease (CKD)?” can be a good overall question, but there is no unique answer to that because the question is not specific. It could be interpreted in several ways:

  1. Among people with CKD living in Denmark today, how large a proportion also have dementia?
    • This question could be answered using a cross sectional design.
  2. Among people who lived with CKD in Denmark 10 years ago how large a proportion have had dementia since?
    • This question could be answered with a cohort design, using appropriate time-to-event methods to take censoring and the competing risk of death into account.
  3. Among people with incident CKD in the period 2010-2025 without prevalent dementia at the time of CKD, how large a proportion have developed dementia since?
    • Again, a cohort design with time-to-event methods could be used to answer this question, but notice that it will be a different cohort compared to the above.

Even these questions are not completely clear. The first question interprets “occurrence” as “prevalence”, while the second and third aim to provide estimates of an incompletely defined “risk”, seeing that risk strictly speaking only makes sense if a time frame is also specified, e.g., 10-year risk.

If the question is not clear before the answer is sought, there is a significant risk of p-hacking2 or HARKing.3

1.2 Definitions and data

Once it is clear what the specific research question is, i.e., what the estimand is, it is also relevant to consider how the population and the individual variables are defined.

Continuing with the example of CKD and dementia, there will be several ways to identify these conditions from registries. Therefore, even if a SAP has been written in great detail, i.e., considerations on how to handle missing data are made, relevant subgroups specified, estimation methods described, table shells ready to be populated etc., it is still important to also describe how a population with “incident CKD”, say, can be identified from registries. Likewise, it needs to be specified how “dementia” and any other variable necessary for the analyses should be defined.

Defining populations and variables is data management and not statistical analyses per se. That does not make specification of these aspects less important, this is just to point out that a data management plan is also essential.

1.3 Two documents or one?

The data management plan and the SAP can often be written as one coherent document with no explicit distinction between the two parts. However, it can still be relevant to keep in mind that data management and statistical analyses are in principle separate parts/phases of a study. In international/multicenter studies, it is generally advisable to use a common data model, so that analytic scripts (scripts needed to carry out statistical analyses as specified in the SAP) can be shared, ensuring the same methods are applied at all centers. To facilitate this, each center must provide a data set that complies with certain rules (specific variable names, types, formats, …) specified by the coordinating center.

However, data management will generally have to differ between centers at some initial level. At the most low-practical level, registries and their variables will have different names and structures. There can also be qualitative differences, e.g., primary vs. secondary care data, granularity of diagnosis/procedure/… codes, or precision of time variables,4 all of which may require different approaches between centers.

In the following, it will be assumed that the data management plan is incorporated into the SAP.

2 Elements of the SAP

2.1 Log of changes

Near the start of the SAP there should be a log of changes, documenting any changes made after the data management or analyses are started. The log should contain dates where changes are made, what the changes are, and reasons for the changes. This will both serve as a reminder within the group of why decisions were made, and if the statistician is replaced the new statistician will be able to get a quick overview of the history of the project (and find reasons for deviations between what is actually done, and what was specified in the protocol).5

Table 1
Log of changes
Date Change Reason
20/06/2023 NA (First version) NA
08/12/2023 HRs estimated by Cox regression replaced by RRs estimated from Aalen-Johansen estimator Non-proportional hazards
31/03/2024 Sensitivity analysis added where CKD is defined by eGFR persistently below 60 mL/min/1.73m² for at least 90 days Reviewer 2 was critical of our initial definition of CKD. We maintain our original definition for primary analyses.

2.2 Background and aims / objectives

A brief description of why the study is important, and what the aims or objectives are.

Consider who the reader is. If this is for a statistician to implement, 2 pages of biology, chemistry and/or anatomy are not helpful. The point is not to copy everything from the protocol (if one exists), but to add the necessary context for the analyses.

2.3 Miscellaneous methods

It is likely that there are several pieces of information that are relevant to write but are rather matter-of-fact in nature and need no further explanation. These parameters may include:

  • A list the registries to be used.
  • Study period and recruitment period: if these are specified exactly once, they only need to be changed once in the document if they are revised while running the study.
  • Time: how long is a month (e.g., 28, 30 or 30.5 days) and/or a year (e.g., 360, 365 or 365.25 days), if these are relevant units of time. Consider if a year should be the same as 12 months if both units are used in the project.
  • Exposure groups: If there is an exposure/intervention/… and a control/comparison/reference… group, consider specifying these. E.g., the exposure group is SGLT2i-users and the control group is GLP-1RA-users.
  • If the independent variable of primary interest is continuous, consider specifying the reference value. E.g., comparing risk of dementia for different eGFR-values, with a value of 60 mL/min/1.73m2 as the reference

2.4 Study population and index date

List in detail how the study population is derived from the raw registry data. This will often require defining an index date from raw data (make sure it is clear what the index date is), and then a series of in- and exclusion criteria to be applied on this index date.6

Prepare a figure shell for a flowchart, and make sure the order of the criteria in the figure aligns with the order the criteria are listed in the text. In some situations it can be relevant to leave out some of the steps used to go from raw data to an index date from the flowchart, however, all exclusion criteria that are applied after the index date is defined should be included in the flowchart by default.

2.5 Variables

Depending on the study design, different types of variables are necessary to define. A table of codes to be used for the individual variable should always be supplied. This table can be a separate file or an appendix within the same document. (See also Section 2.12 below.)

Some variables will likely have different roles, and it might be relevant to have separate paragraphs outlining (as relevant):

  • how different levels of the exposure variable will be defined,
  • how the outcome is defined,
    • if time-to-event-data; list competing and censoring events,
  • other covariates to be reported.

Note that rather than listing all variables in the text, it can be advantageous to point to a table shell (typically for Table 1 describing baseline characteristics), and state that these are the covariates to be included.

2.6 Statistical analyses

This is the key element of the statistical part of the SAP.

Some aspects need little explanation. E.g., populating the table shell for Table 1 with baseline characteristics (in generic studies), does not generally require any particular explanation. It might be relevant to explicitly state that continuous variables (except calendar time) will be reported by their median and interquartile interval (Q1-Q3); that for dichotomous variables only one level (e.g., only the number of individuals with heart failure, not those without) will be reported; that for multilevel categorical variables all levels will be reported.

The analyses need to be done should be specified. Things to consider:

  • which type of (regression) model, if any, should be used,
  • how should variables be included into the models (e.g., continuous variables as splines, or using some other transformation?),
  • how should assumptions be assessed, and what should be done if they do not hold,
  • which non-parametric methods (Kaplan-Meier, Aalen-Johansen, …) should be used,
  • how should missing data be handled (this can be a section on its own),
  • how will uncertainty be estimated if it cannot simply be extracted from standard output,
  • bias analyses to handle unmeasured confounding (possibly move this to sensitivity analyses),
  • sensitivity- and subgroup analyses should have their own sections, but changes from the main analyses could be mentioned here, e.g., we might not assess assumptions for a sensitivity analysis, if they appear to hold for the primary analysis,
  • which software packages to use (R, SAS, Stata, …)

To reduce the risk of p-hacking, it is a good idea to pre-specify interim results to evaluate before moving on. E.g., the size of the population and numbers of exposed might be worth sharing within the group before looking into baseline characteristics. If the numbers are off, something needs to be changed. Likewise, it can be a good idea to produce and share the table with descriptive characteristics before looking into outcome analyses (if relevant), again, to see if something is off before the first version of the main results of the study are known. In this way the risk of rerunning analyses until the desired results are found is reduced.

2.7 Missing data

If missing data are expected it will generally be a good idea to write something about this in advance. Not just how to handle missing data (which can be outlined under statistical analyses), but rather which variables are expected to contain missing data and to what extent.

2.8 Stratified analyses

Specify if the analyses should be repeated within strata of some variables. Also specify if the results should be compared across strata, i.e., if effect-heterogeneity is assessed in some formal way, e.g., by estimating a parameter for an interaction term, with eyeballing being the implicit alternative.

2.9 Sensitivity analyses including subgroup analyses

Having gotten this far, surely, some arbitrary choices will have been made, and the impact of some of these might be relevant to consider in sensitivity analyses. Examples could be

  • use of secondary diagnoses for outcome identification,
  • length of follow-up,
  • washout- or grace periods,
  • recruitment period,
  • lookback period for variables,
  • in- or exclusion of specific patient groups.

If residual confounding is a concern, restriction to specific subgroups can be used.7 E.g., if obesity, alcohol consumption, smoking or other variables that are hard to measure are important confounders, an analysis restricted to individuals who are known or presumed to be similar with respect to these variables (have a record compatible with obesity, alcoholism, COPD, etc.) may be important to run.8

Splitting and reanalyzing the population on calendar time can be relevant. Perhaps the demographics of the population or the indication of a treatment changes over time, introducing a shift in the confounding pattern. However, calendar time in itself is generally hard to interpret as an effect modifier. At best it is a proxy for some other variable. Therefore calendar time stratification arguably belongs under sensitivity analyses rather than stratified analyses.

2.10 Table shells

Make a set of tables that are empty but otherwise ready to be published. There should a table shell for each table to be included in the manuscript or supplementary materials of the publication.

2.11 Figure shells

As for table shells, there should be a specification of each figure to be included in the publication. Remember to make a figure shell for the flow chart!

It is generally harder to make shells for figures than for tables, so consider copying/linking to figures from other publications that can be used as templates. Specify axis-labels, colors etc. if they need to be specific.9

2.12 Coding table

Provide all (diagnosis/procedure/ATC/SNOMED/NPU/…) codes needed to define the study population, exposure, outcome, and covariates, in a structured manner. The coding table can be a separate file (e.g., a spreadsheet) or part of the SAP. Make it easy to copy/import codes from the coding table to analytic scripts so they do not have to be transcribed manually.

Columns to include in the coding table could be:

  • explicit variable names10 (typically not necessary but can be relevant when using a common data model across sites),
  • informal variable names,11
  • data source from which the codes are pulled, e.g., the Danish National Patient Registry (DNPR),
  • the actual codes to be used, e.g., DN18,
  • patient- and diagnosis type (applicable to Danish National Patient Registry); in-, out-, ER-patients, primary or secondary diagnoses,
  • lookback from index date,
  • notes on issues that are relevant for a specific variable that does not warrant a column in it self. E.g., thresholds for biomarkers defining conditions like CKD (eGFR) or type 2 diabetes mellitus (HbA1c); that a diagnosis code mycosis fungoides must be made a department of dermatology to be included

Further structure can be added as in Table 2 where variables are sorted by their role (exposure, in-/exclusion, outcomes, etc.).

Table 2: Suggested structure for a coding table
Coding table
Variable Data source Codes Patient type Diagnosis types Lookback Notes
Exposure NA NA NA
SGLT2i Prescription registry
GLP-1RA Prescription registry Exclude brand names Saxenda and Wegovy
In-/exclusion
T2DM/Glucose lowering drugs Prescription registry NA NA 1 year
T2DM/HbA1c Laboratory registry NA NA 3 years Any HbA1c > … indicates T2DM
T2DM/diagnoses Patient registry All Primary, secondary 10 years
Recent plague or Cholera Patient registry All Primary, secondary 90 days
Outcomes Inpatient Primary Only at a department of infectious diseases
Plague Patient registry
Cholera Patient registry
Comorbidities
Comedication
Biomarkers

2.13 References

List of literature referenced in the SAP

3 Pitfalls to avoid

  1. Avoid repetitions.
    • Whenever possible specify things exactly once, then there is exactly one place to change this when revising. This means that the statistician/statistical programmer does not accidentally read the recruitment period the one place you overlooked when revising.
  2. Do not write codes from the coding table in the text, it serves no purpose when it is also specified in a coding table (also see point 1.).
  3. Avoid circular reasoning/definitions. It can be necessary to refer to a later section of the SAP, but beware that this later section does not point back so that, e.g., the index date is defined as the date of the index date.

4 Simple tip for specifying the order of inclusion-/exclusion criteria

Draw timelines of hypothetical patients’ records in registries. How might records of various conditions relative to each other and the recruitment period affect who gets in- and excluded? Who would you want to include and who not? How can you set up the order of in- and exclusion criteria to achieve this? Of particular importance is at which point in the ordering of your inclusion-/exclusion criteria you define the index date. Once the index date is set, it is generally of minor importance how subsequent criteria are ordered from the point of view of the data manager.

Consider this description of a study population from a hypothetical protocol:

We will include individuals diagnosed with herpes zoster between 2000 and 2020 using the Danish National Patient Registry. All patients are required to have been diagnosed at a department of dematology to be included. Patient will be included at the time of their first diagnosis.

Beyond having a diagnosis of herpes zoster, the protocol outlines three key elements to consider, when settling on the order of the inclusion-/exclusion criteria:

  • calendar time of diagnosis,
  • department of dermatology, and
  • first observation per patient.

These criteria can be ordered in 6 different ways, and the above ordering is based on the ordering they are mentioned in the text, but that might not be optimal.

To get a feeling of who you want to include and how you achieve that you might consider different hypothetical patients with a diagnosis code for herpes zoster as in Figure 2.

Figure 2: Hypothetical herpes zoster diagnosis patterns

If we had started by extracting all diagnosis codes from the Danish National Patient Registry and then applied the three criteria in the order given above, we would

  1. ignore the first diagnosis of patients 1, 4 and 6 (they are before the study period - patient 6 is excluded),
  2. ignore the diagnoses given at non-dermatology departments (this removes patient 5), and
  3. include all remaining patients at their first (remaining) diagnosis at a department of dermatology.

So, for this ordering of inclusion-/exclusion criteria, all patients, who had a diagnosis at a department of dermatology at some point during the study period, would be included regardless of their other diagnoses. That might be reasonable for a descriptive study, e.g.,

  • What characterizes patients with herpes zoster at departments of dermatology in Denmark?

but perhaps less so in a cohort study e.g.,

  • What is the prognosis after incident herpes zoster?

where you only want to include incident cases, so at least patient 4 should also be excluded. Depending on the positive predictive value of diagnoses codes for herpes zoster at non-dermatology departments, patients 1 and 3 can be in- or excluded.

If specificity is prioritized, i.e., you want to increase certainty of the condition being incident, then you probably want to exclude patients with a prevalent diagnosis at a non-dermatology department at the cost of population size, including only patient 2 from Figure 2. To do this, you can use the ordering:

  1. restrict to the first observation per patient (this excludes no patient), the time of the diagnosis marks the index date,
  2. exclude patients with index dates outside of the study period (this excludes patients 1, 4 and 6),
  3. exclude patients not seen at a dermatology department on the index date (this excludes patients 3 and 5).

Because they come after selection of the index date, the ordering of the latter two steps is irrelevant when it comes to the final study population. However, there might still be a natural ordering of these steps from a clinical point of view. Keep this in mind when specifying the inclusion-/exclusion sequence.

If a more sensitive approach is used, e.g., because you don’t trust diagnosis codes from non-dermatology departments, you might argue that including patients 1-3 is reasonable within the scope of the project. This can be done by reordering the criteria as such:

  1. include observations from dermatology departments only (patient 5 is excluded),
  2. restrict to the first diagnosis per patient, this marks the index date,
  3. restrict to index dates within the study period (patients 4 and 6 are excluded).

Footnotes

  1. There are possibly several estimands.↩︎

  2. xkcd on p-hacking↩︎

  3. Wikipedia on HARKing.↩︎

  4. Examples from Danish registries include laboratory data which hold dates and time stamps (hours and minutes), whereas the National Health Insurance Service Registry only includes week number.↩︎

  5. In Table 1 an entry is made for the completion of the first version of the SAP for timeline purposes. This is not necessary and can be omitted.↩︎

  6. Sometimes there will be more than one day that is important. E.g., for a population with post-surgical infection, it might be necessary to apply some criteria on the date of the surgery and others on the date of the infection.↩︎

  7. Note the difference between a subgroup analysis and a stratified analysis.↩︎

  8. Depending on the strength of the confounder it could be considered if the primary analysis should be within the restricted population, coming at the cost of generalizability This consideration should be made before running the analyses.↩︎

  9. It will often be quick to change these parameters, but if you know in advance what they should be you might as well spell it out explicitly.↩︎

  10. I.e., the name the variable should have in the analytic script, e.g., gld_180d. If formal variable names are used, they have to conform to the statistical software package used.↩︎

  11. I.e., the name as you would refer to it in text, e.g., ‘glucose lowering drugs within 180 days’.↩︎