probability of assignment to trial arms

Masks Strongly Recommended but Not Required in Maryland, Starting Immediately

Due to the downward trend in respiratory viruses in Maryland, masking is no longer required but remains strongly recommended in Johns Hopkins Medicine clinical locations in Maryland. Read more .

Vaccines
Masking Guidelines
Visitor Guidelines

Institutional Review Board

Iii. informed consent guidance - fda regulated studies.

Investigators who conduct research involving investigational drugs may be asked by commercial sponsors to conform to GCP guidelines, including the guidelines for consent documents ( http://www.fda.gov/downloads/Drugs/Guidances/ucm073122.pdf ). The International Conference on Harmonization has published a list of the 20 required elements for consent forms used in studies of investigational pharmaceutical agents. Pharmaceutical sponsors write consent forms to meet the GCP standard.

Note: The GCP document of required elements for consent is not a regulatory requirement in the United States. FDA regulations on consent do not require all consent elements recommended by GCP guidance.

These required elements under GCP are:

(a) That the trial involves research.

(b) The purpose of the trial.

(d) The trial procedures to be followed, including all invasive procedures.

(e) The participant's responsibilities.

(f) Those aspects of the trial that are experimental.

(g) The reasonably foreseeable risks or inconveniences to the participant and, when applicable, to an embryo, fetus, or nursing infant.

(h) The reasonably expected benefits. When there is no intended clinical benefit to the participant, the participant should be made aware of this fact.

(i) The alternative procedure(s) or course(s) of treatment that may be available to the participant, and their important potential benefits and risks.

Note: FDA regulations do not require a list of benefits and risks associated with alternatives to participation.

(j) The compensation and/or treatment available to the participant in the event of trial related injury

(k) The anticipated prorated payment, if any, to the participant for participating in the trial.

(l) The anticipated expenses, if any, to the participant for participating in the trial.

(m) That the participant's participation in the trial is voluntary and the participant may refuse to participate or withdraw from the trial, at any time, without penalty or loss of benefits to which the participant is otherwise entitled.

(n) That the monitor(s), the auditor(s), the IRB/IEC [Institutional Ethics Committee], and the regulatory authority(ies) will be granted direct access to the participant's original medical records for verification of clinical trial procedures and/or data, without violating the confidentiality of the participant, to the extent permitted by applicable laws and regulations and that, by signing a written informed consent form, the participant or the participant's legally acceptable representative is authorizing such access.

(o) That records identifying the participant will be kept confidential and, to the extent permitted by the applicable laws and/or regulations, will not be made publicly available. If the results of the trial are published, the participant's identity will remain confidential.

(p) That the participant or the participant's legally acceptable representative will be informed in a timely manner if information becomes available that may be relevant to the participant's willingness to continue participation in the trial.

(q) The person(s) to contact for further information regarding the trial and the rights of trial participants, and whom to contact in the event of trial-related injury.

(r) The foreseeable circumstances and/or reasons under which the participant's participation in the trial may be terminated.

(s) The expected duration of the participant's participation.

(t) The approximate number of participants involved in the trial.

Note: This is an optional element in FDA regulations, guided by whether including this information could influence enrollment.

The GCP standards for consent forms may be found at the following web site: http://www.fda.gov/cder/guidance

Clinical Trial Randomization Tool

For clinical trials, educational purposes, or just for your own interest.
Uses MTI randomization to generate the allocation sequence.
Default values are provided below. You may adjust these as you require or prefer.
When you click “Request Confirmation Email,” your request will be sent to a server. You will receive an email when your download is ready (typically in a few minutes).
For more detailed instructions, you may view the Tool Instructions page.

Basic Trial Info

A simple trial design is used by default. For more complex trials, change the trial settings below to customize arms and/or stratification.

You have selected maximal as the randomization method. This method is supported only for two-arm trials with an arm allocation ratio of 1:1.

These ratio values will be simplified to 1:1 when you save this section.

The MTI must be at least double the largest arm ratio value, which is 1 , so the MTI will be increased to 2 when you save this section.

Confirm reset

Based on the number of categories for your 1 stratification variables , your trial will have 1 strata . For each stratum, you will receive one worksheet containing 1,000 assignments .

Your results will use the following names for each stratum's sequence worksheet. Due to length restrictions in Microsoft Excel, category names may be truncated.

Algorithm Parameters

The following parameters should work well for most trials, but can be further customized.

Select a method to use for randomization. All methods shown use MTI randomization and are suitable for use. For more information about these different methods, consult the Learn About Randomization page.

The most under-assigned arms are always favored by a balance-forcing probability that varies based on the current degree of imbalance.

Neither arm is favored until the MTI threshold is reached — each arm's probability of assignment is always 0%, 50%, or 100%, based on the current imbalance. The Big Stick method is a special case of Chen's procedure where the balance-forcing probability is 50%.

This method is supported only for two-arm trials with an arm allocation ratio of 1:1.

The under-assigned arm is always favored by a preset balance-forcing probability, e.g., 60%. Chen's procedure is a generalized version of Big Stick where the balance-forcing probability can be greater than 50%.

At any point in the randomization process, for any stratification group, no trial arm will be assigned more than this many participants than any other arm.

The MTI can be any integer between twice the largest allocation ratio and 20 .

At any point when the arms are imbalanced, this is the percent probability that the next enrollment goes to the arm that currently has fewer enrollments. When both arms have equal number of enrollments, the probability is 50% to each arm. If assignment to the larger arm would violate the MTI, the probability is 100% and the next enrollment goes to the smaller arm.

Contact Info

The Clinical Trial Randomization Tool is a web application developed by the National Cancer Institute (NCI) and the National Institutes of Health (NIH) to help researchers generate randomization sequences for their clinical trials.

A study statistician should be consulted in conjunction with the design and implementation of your randomization scheme to ensure it is appropriate for your clinical trial. NCI and NIH are not responsible for how any randomization sequence is used in a clinical setting.

Search Menu
Advance Articles
Author Guidelines
Submission Site
Open Access Policy
Self-Archiving Policy
Why publish with Series B?
About the Journal of the Royal Statistical Society Series B: Statistical Methodology
About The Royal Statistical Society
Editorial Board
Advertising & Corporate Services
Journals on Oxford Academic
Books on Oxford Academic

Article Contents

1 introduction, 3 a robust optimal penalization parameter κ, 4 application to a phase ii clinical trial, 5 application to a phase ii clinical trial with co-primary efficacy end points, 6 discussion, supporting information, acknowledgements.

< Previous

An Information Theoretic Approach for Selecting Arms in Clinical Trials

Article contents
Figures & tables
Supplementary Data

Pavel Mozgunov, Thomas Jaki, An Information Theoretic Approach for Selecting Arms in Clinical Trials, Journal of the Royal Statistical Society Series B: Statistical Methodology , Volume 82, Issue 5, December 2020, Pages 1223–1247, https://doi.org/10.1111/rssb.12391

Permissions Icon Permissions

The question of selecting the ‘best’ among different choices is a common problem in statistics. In drug development, our motivating setting, the question becomes, for example, which treatment gives the best response rate. Motivated by recent developments in the theory of context-dependent information measures, we propose a flexible response-adaptive experimental design based on a novel criterion governing treatment arm selections which can be used in adaptive experiments with simple (e.g. binary) and complex (e.g. co-primary, ordinal or nested) end points. It was found that, for specific choices of the context-dependent measure, the criterion leads to a reliable selection of the correct arm without any parametric or monotonicity assumptions and provides noticeable gains in settings with costly observations. The asymptotic properties of the design are studied for different allocation rules, and the small sample size behaviour is evaluated in simulations in the context of phase II clinical trials with different end points. We compare the proposed design with currently used alternatives and discuss its practical implementation.

Over recent decades, a variety of methods for clinical trials aiming to select the ‘optimal’ arm (e.g. dose, combination of treatments and treatment regimen) have been proposed in the literature (see for example O’Quigley et al . ( 2017 ) for a recent review of novel methods). Given m arms, the aims of phase I and phase II clinical trials are often to select the target arm (TA): the arm whose toxicity probability is closest to the maximal accepted target, 0 < γ t < 1, or (and) whose efficacy probability is closest to the target efficacy, 0 < γ e ⩽ 1, where higher values of γ e correspond to more effective arms. Despite the similar problem formulation for phase I (evaluating toxicity) and phase II (evaluating efficacy) trials, quite different approaches are generally used.

In phase I dose escalation trials, designs assuming a monotonic dose–toxicity relationship have been shown to have good operating characteristics in the context of single-agent trials (Iasonos et al ., 2016 ; Clertant and O’Quigley, 2017 ). There is, however, considerable uncertainty in the toxicity ordering for clinical trials investigating combinations of agents or when considering different treatment schedules (Wages et al ., 2011 ). Methods based on a monotonicity assumption are of limited use for such trials. To overcome this issue and to relax the monotonicity assumption, some specialized approaches have been proposed; see for example Riviere et al . ( 2015 ) for a review of recent methods for combination trials, and Wages et al . ( 2014 ) and Guo et al . ( 2016 ) for approaches to dose–schedule studies. The majority of novel phase I methods relaxing the monotonicity assumption rely either on a complex parametric model or on explicit orders of toxicity. Although such methods allow borrowing information between treatment arms they might fail to find the TA in trials with a large number of potential orderings and a limited sample size. Furthermore, the majority of such designs consider a single binary end point only whereas more complex outcomes are becoming more frequent in dose finding trials; see Lee et al . ( 2019 ) for an example with multiple toxicity grades, and Thall and Cook ( 2004 ) for examples of trials with multinomial outcomes assuming a monotonic dose–toxicity relationship. Despite this, methods for studies with non-binary outcomes relaxing the monotonicity assumption are sparse to date.

Whereas relaxing the assumption of monotonicity between treatment arms in phase I studies is relatively novel, designs that consider arms independently have been proposed for a long time in the phase II setting (see for example Stallard and Todd ( 2003 ), Koenig et al . ( 2008 ) and Magirr et al . ( 2012 )). Williamson et al . ( 2017 ) have recently advocated designs maximizing the expected number of responses in small populations trials. As a result, adaptive randomization methods and optimal multiarm bandit approaches are starting to be considered as appropriate candidates to fulfil this objective. Although multiarm bandit designs outperform other well-established methods in terms of the expected number of successes, they can suffer low statistical power for testing comparative hypotheses (Villar, Bowden and Wason, 2015 ). This problem corresponds to the ‘exploration vs exploitation’ trade-off (Azriel et al ., 2011 ). Solutions to tackle this balance and to achieve a high power while still assigning the majority of patients to superior treatments are an emerging topic in the multiarm bandit field. In particular randomized versions of the optimal multiarm bandit designs (Villar, Wason and Bowden, 2015 ), and approaches fixing the allocation of patients to the control arm (Villar, Bowden and Wason, 2015 ; Villar, Wason and Bowden, 2015 ; Williamson et al ., 2017 ; Villar et al ., 2018 ) were proposed. These methods have primarily been developed for binary end points and selecting the TA corresponding to the highest response probability and, as a result, cannot be applied to a problem of selecting the arm with the arbitrary target probability, γ e , such as studies looking to select ED80, the dose giving 80% of the maximum efficacy. At the same time, multiarm bandit approaches for non-binary end points, e.g. for multinomial (Glazebrook, 1978 ), normal (Jones, 1970 , 1975 ) and exponential (Gittins et al ., 2011 ) end points, have been known for a long time but have only recently started to be explored in more detail for application in clinical trials (Smith and Villar, 2018 ; Williamson and Villar, 2019 ).

Although current guidelines generally recommend single end points for primary analyses of confirmatory clinical trials, it is recognized that certain settings require inference on multiple end points for comprehensive conclusions on treatment effects (Ristl et al ., 2018 ). Consequently, phase II clinical trials evaluating several end points, e.g. toxicity and efficacy end points, co-primary efficacy end points or nested efficacy end points, start to attract attention in the literature (Song, 2015 ; Zhou et al ., 2017 ). Although formal testing for a difference in treatment responses remains the main focus of designs proposed for such trials, maximizing the number of patients receiving the superior treatment is also of crucial importance—specifically in small population trials. Despite that, response-adaptive designs for settings with multiple end points have not been extensively studied yet.

This work is motivated by several phase I and phase II clinical trials which could benefit from an experimental design that does not require a parametric or monotonicity assumption between arms, and for which the authors contributed as statistical collaborators. One of them is the ‘TAILoR’ (Pushpakom et al ., 2019 ) trial which considered three active arms and placebo with the primary objective to find whether the response of at least one active arm is significantly different from that of the placebo group. The second objective was to find the optimal arm defined as the arm with the largest difference compared with placebo. The original study employed a two-stage design in which half of the patients were equally randomized to four arms initially before a selection of all promising arms was undertaken. This design is expected to lead to a reliable answer to the first question but will result in a low number of patients on the optimal arm. Therefore, response-adaptive designs such as the multiarm bandit are of interest. Multiarm bandit approaches, however, can result in a failure to answer the primary goal of the trial. Therefore, a design that can balance these objectives is of interest.

The research problem that was described above can be considered as the general issue of correct selection of the TA whose response probability is closest to the percentile, 0 < γ ⩽ 1. Importantly, an investigator aims to assign the majority of patients to the TA but has limited information about the dependences between arms.

In this work, we propose a general response-adaptive experimental design for studies with multinomial outcomes to solve a generic problem of selecting the TA under ethical constraints (e.g. maximize the number of patients on the superior arm) and when each observation is costly. Based on the theory of weighted (or, context-dependent) information measures (Belis and Guiasu, 1968 ; Kelbert et al ., 2016 ), we propose to use the gain in information (found as a difference of the Shannon differential entropy and the weighted Shannon differential entropy) as a criterion for the decision making in clinical trials. The approach proposed enables incorporation of the context of the outcomes (e.g. avoid high toxicity or low efficacy) in the information measures themselves. This is achieved by assigning a greater ‘weight’ to the information that is obtained about arms with desirable characteristics. Through specifying an arbitrary parametric weight function, the approach can be applied to various experiments with (ethical) constraints tailored for the specific investigator’s needs. In this work, two families of weight functions with a particular interest in arms whose response probabilities are in the neighbourhood of γ are considered in more detail. We show that, subject to appropriate tuning, the design employing the derived criteria allocates each patient to the treatment that is estimated to be the best while taking into account the uncertainty about the estimates for each arm and can lead to better operating characteristics than do alternative approaches. This leads to fulfilling statistical goals of the experiment under the ethical constraints.

The idea of applying information theoretic concepts, and specifically the Shannon entropy (Shannon, 1948 ), to govern treatment selection dates back to the work by Klotz ( 1978 ) who introduced the maximum entropy constrained balance randomization design which seeks to maximize the Shannon entropy subject to the expected imbalance. This and related ideas of using the Shannon entropy, however, have received little attention in the literature until very recently when other designs for clinical trials using the information gain principle have been proposed (see for example Barrett ( 2016 ) and Kim and Gillen ( 2016 )). These works, however, employ the standard definitions of information measures and include the ethical considerations through additional constraints on the information theoretic criteria derived. The need for these constraints arises as standard measures of information do not depend on the value of the outcomes themselves but only the corresponding probabilities of these outcomes (Kelbert and Mozgunov, 2017 ). Therefore, they are called ‘context free’ (Kelbert et al ., 2016 ). Although the context-free nature gives the notion of information great flexibility which explains its successful application in various fields, it might be also considered as a drawback in many application areas such as clinical trials. It was found that the ‘context’ of the experiment can be included in the information measures directly by using a weight function (Belis and Guiasu, 1968 ; Kelbert and Mozgunov, 2015 ) that gives more value to the points of specific interest. Based on this, a phase I–II dose finding clinical trial design with trinary outcomes that utilizes an information gain criterion has been developed by Mozgunov and Jaki ( 2019 ). Furthermore, similar arguments and weight functions were used to derive a loss function for phase I dose escalation trials with binary responses (Mozgunov and Jaki, 2020 ).

The current work builds on these recent developments and expands the ideas in the following ways. Firstly, we consider a generic setting with multinomial outcomes and study a family of weight functions parameterized by the newly introduced penalization parameter κ that generalizes the criteria that were used by Mozgunov and Jaki ( 2019 , 2020 ) and extends the potential applications beyond dose finding clinical trials. Secondly, we propose an asymptotically unbiased and consistent estimator of the criterion derived and study the theoretical properties of the design based on this criterion. Finally, we propose a unified framework for using the weighted information gain to govern the treatment selection that can be used with an arbitrary parametric weight function that is specific to the ethical considerations of a given experiment.

The remainder of the paper is organized as follows: derivations of the criterion and assignment rules are given in Section 2 . The procedure for finding a robust optimal value of the penalization parameter κ of the design proposed is given in Section 3 . The design is applied to the motivating setting of phase II in Section 4 and to a trial with co-primary efficacy end points in Section 5 . We conclude with a discussion in Section 6 .

The programs that were used to provide the results can be obtained from

https://academic.oup.com/jrsssb/issue/ .

2.1 Selection criteria

Our central proposal is to use this measure Δ n j to govern the arm selection in a sequential experiment. The weight function that is used to compute the gain in information can be of different forms to reflect the question that an investigator is interested in and to define the ‘value’ of the information in different areas of the simplex S d ⁠ . The weight function should, therefore, be set in line with the objectives of the clinical trial. Drawing a parallel with the multiarm bandit approaches, Equation (5) ) can be also interpreted as defining an index for allocating sampling observations to each arm. In this work, we shall consider two families of weight functions that are suitable for two different clinical settings. First, we focus on a family of weight functions for the sensitive estimation question as above, and we introduce a second family accounting for minimum and (or) maximum thresholds in Section 5 .

Theorem 1 Let h ( f n j ) and h ϕ n j ( f n j ) be the standard and weighted differential entropies of equation (2) ) with weight function (6) corresponding to arm j . Let lim n j → ∞ x j ( i ) ( n j ) / n j = α j ( i ) , i = 1 , 2 , … , d ⁠ , and Σ i = 1 d x j ( i ) = n j ⁠ ; then Δ n j = O ( 1 n j 1 − 2 κ ) as n j → ∞ if κ < 1 2 ; Δ n j = − 1 2 { ∑ i = 1 d ( γ ( i ) ) 2 α j ( i ) − 1 } n j 2 κ − 1 + ω ( α j , γ , κ , n j ) + O ( 1 n j η ( 1 − κ ) − κ ) as n j → ∞ if κ ⩾ 1 2 where ω ( α j , γ , κ , n j ) = ∑ u = 3 η ( − 1 ) u − 1 u n j u κ − u + 1 { ∑ i = 1 d ( γ ( i ) ) u ( α j ( i ) ) u − 1 − 1 } and η = ⌊ ( 1 − κ ) − 1 ⌋ .

All proofs are provided in the the on-line supplementary materials .

The information gain Δ n j tends to 0 for κ < 1 2 ⁠ , which implies that assigning a value of information with a rate that is less than 1 2 is insufficient to emphasize the importance of the context of the study. However, the limit is non-zero for κ ⩾ 1 2 ⁠ . Following the conventional information gain approach, one would like to make a decision that maximizes the statistical information in the experiment. The leading terms of the information gain Δ n j are always non-positive, and for any fixed n the asymptotics terms achieve the maximum value 0 at the point α j ( i ) = γ ( i ) ⁠ , i = 1, …, d (all constants are cancelled out). This reflects the fact that, by adding one more research question into the information measure through the weight function, the uncertainty in the experiment (in terms of the differential information measure) is increased. There is no additional uncertainty when the answer to both research questions coincide. Therefore, it follows that collecting more information about the arm which has characteristics α j close to the target γ (the ethical constraint of the experiment) implies maximization of the information gain Δ n j ⁠ . Consequently, each patient tends to be assigned to the TA, and the criterion Δ n j is a patient’s gain criterion (Whitehead and Williamson, 1998 ). It will be further demonstrated that, for certain values of the parameter κ , Δ n j also takes into account the statistical uncertainty of the arm and achieves the goal of the trial under ethical constraints. Therefore, we propose to use the information gain Δ n j for the arm selection in a sequential experiment.

Whereas the term in brackets reflects how close the vector of the parameters α j is to the vector of the target characteristics, γ , the balance in the ‘exploration vs exploitation’ trade-off is controlled by the term n j 2 κ − 1 reflecting the penalty on the number of observations on the same arm. A larger number of patients on an arm makes it less desirable to be chosen. Therefore, as the experiment progresses the design requires an increasing level of confidence that the arm selected is the TA. Increasing values of κ correspond to a greater penalty of the number of patients allocated to a specific arm and hence is expected to lead to a more spread allocation. This corresponds to a greater interest in the statistical power of the experiment. In contrast, κ = 1 2 corresponds to no penalty and is of particular interest in trials with small sample sizes. We shall refer to κ as the penalization parameter . The penalization term on the number of observations in a given arm is of growing interest in reinforcement learning, where it is considered as a way to address the exploration–exploitation trade-off similarly to the problem considered (see for example an overview of the related literature by Browne et al . ( 2012 )).

2.2 Estimation

Whereas the desirable characteristics of the TA, γ , are known and fixed before the trial, selection criterion (7) also depends on the true unknown parameters, α j ⁠ . Below, we propose an estimator of selection criterion (7).

Consider a discrete set of m arms, A 1 , …, A m , associated with α 1 , … , α m and n 1 , …, n m observations. Arm A j * is optimal if δ ( κ ) ( α j * , γ ) = inf j = 1 , … , m δ ( κ ) ( α j , γ ) ⁠ . To estimate δ ( κ ) ( α j , γ ) ⁠ , consider a random variable δ ~ n j ( κ ) ≡ δ ( κ ) ( Z n j , γ ) with Z n j having Dirichlet distribution (2) ). Theorem 2 shows that δ ~ n j ( κ ) is asymptotically unbiased, consistent and asymptotically normal.

Theorem 2 Let Z ¯ be a standard Gaussian random variable and Z ~ n j = Σ − 1 / 2 ( Z n j − α j ) be a random variable with probability density function f ~ n j where the probability density function of Z n j is given in equation (2) ) with lim n j → ∞ x j ( i ) ( n j ) / n j = α j ( i ) for i = 1, 2, …, d , Σ i = 1 d x j ( i ) = n j and Σ j is a d -dimensional square matrix with elements Σ j [ u v ] = α j ( u ) ( 1 − α j ( u ) ) / n j if u = v and Σ j [ u v ] = − α j ( u ) α j ( v ) / n j if u ≠ v . Let δ ~ n j ( κ ) = δ ( κ ) ( Z n j , γ ) ⁠ , ∇ δ ( κ ) ( z , γ ) = ( ∂ δ ( κ ) ( z , γ ) / ∂ z ( 1 ) , … , ∂ δ ( κ ) ( z , γ ) / ∂ z ( d ) ) T ⁠ , δ ¯ n j ( κ ) = Σ ¯ j − 1 / 2 { δ ( κ ) ( Z n j , γ ) − δ ( κ ) ( α j , γ ) } where Σ ¯ j = ∇ α j T Σ j ∇ α j and ∇ α j ≡ ∇ δ ( κ ) ( z , γ ) evaluated at z = α j ⁠ . Then, lim n j → ∞ E ( δ ~ n j ( κ ) ) = δ ( κ ) ( α j , γ ) ⁠ , lim n j → ∞ V ( δ ~ n j ( κ ) ) = 0 and δ ¯ n j ( κ ) weakly convergences to Z ¯ ⁠ .

2.3 Assignment rules

Estimator (9) is used to govern the selection between arms during the experiment and summarizes the arm’s characteristics. It can be applied to different types of sequential experiments. We consider two assignment rules: a deterministic ‘select-the-best’ rule, and a randomization rule that randomizes patients to arms. These rules follow the setting of the motivating clinical trials. For example, the deterministic rule prioritizes the exploitation over exploration and can be used in phase I trials evaluating toxicity where the randomization to all doses might not be ethical (an example is provided in the on-line supplementary materials ) or in the phase II setting if the goal of maximizing the number of successes is prioritized (considered in Section 4 ). The randomization rule could be favoured when an investigator is primarily interested in a high statistical power (Section 4 ).

2.3.1 Deterministic ‘select-the-best’ rule

2.3.2 randomization rule, 2.4 design consistency.

Although a large sample size is never achieved in early phase clinical trials, the consistency condition of the design ensures that the approach provides a more reliable selection of the TA as the sample size increases. The consistency condition for the proposed design under two assignment rules under the weight function ϕ n (·) is given in theorem 3.

Theorem 3 Consider the experimental design with a selection criterion based on δ ~ n j ( κ ) ⁠ , m arms and true probabilities vectors α j , j = 1 , … , m ⁠ . Then, (a) the design is consistent under the randomization rule for κ ⩾ 0.5 and (b) the design is consistent under the deterministic rule for κ > 0.5.

Note that, under the deterministic rule, κ = 0.5 leads to a lack of consistency of the design. The effect on this will be considered in the setting of phase I clinical trials with a small sample size (see the on-line supplementary materials ) and in the setting of phase II clinical trials with moderate sample sizes.

2.5 An alternative weight function

The weight function ϕ n j ( · ) above can be a suitable choice when an investigator is interested in the TA with particular characteristics. At the same time, alternative research questions (e.g. composite) can be of interest to an investigator in the trial. For example, in some clinical trials, lower and/or upper bounds on the characteristics of interest can be imposed. The information theoretic approach proposed can also be applied to such more complex questions. We provide an example below.

Consider a trial in which we are still interested in the TA as close as possible to γ but only if these characteristics are ‘sufficiently close’ to the target. For example, in the setting of a phase II clinical trial with binary responses (Section 4 ), the goal can be formulated as ‘to select the TA with the highest response probability that is above the minimum efficacy bound ψ ’. One of the possible weight functions reflecting these trial objectives can be formulated as follows.

The inclusion of the boundary value in the weight function for n j = 100 ⁠ , κ = 0.5, the minimum efficacy value ψ = 0.70 and different number of responses is demonstrated in Fig. 1 with the target efficacy γ = 1.

$Exact information gains by using the weight function ϕnj () and the weight function ϕnj* () with the minimum efficacy value ψ = 0.7 for various values of the responses xj = 60, 65,\ldots, 100$

The gains in information for the various weight functions are nearly the same for the number of outcomes x j ⩾ 70 corresponding to an estimated probability of efficacy of p ^ j ⩾ 0.70 ⁠ , and both information gains are still maximized for the highest efficacy probability. However, when the estimated probability falls below the minimum efficacy threshold, the information gain Δ n j * decreases noticeably faster than Δ n j ⁠ . As a result, Δ n j * enables better discrimination between efficacious ( p j > ψ ) and inefficacious arms. Importantly, the information gain Δ n j * can still distinguish treatment arms with an estimated efficacy probability of less than 0.70 because of the underlying uncertainty. Note that, although it was found that Δ n j tends to a non-positive value, its exact value for moderate sample sizes can be above 0 as demonstrated in Fig. 1 . Nevertheless, a larger value of the criterion still corresponds to a more promising treatment and therefore can be used to discriminate between arms. We shall consider how the weight function with the boundary values affects the performance of the design in more detail in Section 4 .

3.1 Procedure

The penalization parameter κ controls the exploration–exploitation trade-off. Therefore, the choice of the optimal value of κ (e.g. in the sense of maximizing the expected number of successes in a trial, ENS) is crucial. As follows from the proof of theorem 3, the optimal value depends on the sample size, number of treatment arms and the true probabilities of the response of all treatment arms.

As the true probabilities of response are unknown, we propose an approach to finding the robust optimal value of κ that does not require knowing the true probabilities of response and leads to nearly optimal characteristics in the absence of prior information on the response probabilities. For a given optimality criterion, the approach builds on the algorithm by Clertant and O’Quigley ( 2017 ) and takes the following form.

Define a set of Z scenarios, S 1 , …, S Z , where a scenario is the set of parameters α j defining the distribution of outcomes for the treatment arm j .

Define the quantity of interest q ( κ ) and the objective function g { q ( κ )}.

Obtain q ( κ | S z ) for all κ on the prespecified grid and all z = 1, …, Z .

Find the optimal value of κ opt = arg min κ ( 1 / Z ) Σ z = 1 Z g { q ( κ | S z ) } ⁠ .

Such a procedure results in a robust optimal design with the parameter κ opt that optimizes the objective function g (·). In this work, we shall consider two objective functions g { q ( κ )}, with quantities of interest q (·) corresponding to different aims in the exploration–exploitation trade-off.

3.2 Objective functions

To find the robust optimal design parameter, we use the context with binary responses. Specifically, we shall consider two objectives functions:

maximizing ENS and

achieving the prespecified level of power under the least favourable configuration LFC.

Let n j ( κ ) be the total number of patients who are assigned to the treatment arm j by using the design with parameter κ , p j is the response probability for arm j and q 2 FR ( S z ) be the power that is attained by using fixed randomization (FR). The objective functions, and the corresponding quantities of interest, q 1 and q 2 , are given in Table 1 .

Objective function and corresponding quantities of interest for maximizing ENS and achieving the prespecified level of power criteria †

LFC, least favourable configuration.

For the ENS-criteria, κ S z * = arg min κ { Q ( κ | S z ) } is the scenario-specific optimal κ , and the objective function g 1 (·) minimizes the expected losses in ENS that are associated with the use of a non-scenario-specific optimal parameter. For the power criteria, the objective function is constructed to guarantee that, with a probability of at least ξ , the design will achieve 80% of the power that is attained by FR. Here, the power that is attained by using FR under scenario S z ⁠ , q 2 FR ( S z ) ⁠ , normalizes for different scenarios for the fixed sample size. We apply the procedure in Section 4.2 and evaluate its performance in Section 4.3 and Section 5 .

3.3 Computation of the quantities of interest

For the values of the penalization parameter κ on the grid κ = 0.50, 0.51, …, 1, finding the robust optimal values reduces to computing q ( κ ) for a given value of κ . Similarly to multiarm bandit approaches (Villar, Bowden and Wason, 2015 ), the challenge here is that an analytical expression for the allocation of patients cannot be found. It can, however, be computed recursively.

It is, however, known that this recursive procedure becomes computationally demanding or even infeasible as the sample size and (or) number of arms increases. Therefore, following Villar, Wason and Bowden ( 2015 ), we shall use Monte Carlo simulations to approximate this distribution. It was found that the Monte Carlo simulations provide an accurate approximation of the distribution of allocations and noticeable gains in computational time. A comparison of the exact computations and the Monte Carlo approximation for various values of n is provided in the on-line supplementary materials .

4.1 Setting

Consider a phase II clinical trial whose goals are

to find the most effective treatment and

to treat as many patients as possible on the optimal treatment.

Similarly to the motivating trial, we consider m = 4 treatments. We assume that the primary end point is a binary measure of efficacy (e.g. response to treatment). Although there are various competing approaches that could be applied in the setting considered, we limit the comparison to two alternative designs that are known to have good statistical properties in terms of either the number of patients treated or the statistical power. Specifically, we compare with the Gittins index (GI) approach (using the discount factor of 0.99 and non-informative priors; see Gittins and Jones ( 1979 ) and Villar, Bowden and Wason ( 2015 ) for more detail), which is the nearly optimal design in terms of maximizing the expected number of successes, ENS, and will serve as a benchmark for this characteristic. Additionally we also compare with fixed and equal randomization, which is known to lead to high statistical power.

We consider two scenarios that were investigated by Villar, Bowden and Wason ( 2015 ). Scenario 1 investigates n = 423 and the true efficacy probabilities are (0.3, 0.3, 0.3, 0.5) whereas scenario 2 considers n = 80 with true efficacy probabilities (0.3, 0.4, 0.5, 0.6). Following Villar, Bowden and Wason ( 2015 ), we consider the hypothesis H 0 : p 0 ⩾ p i for i = 1, 2, 3 with the family-wise error rate calculated at p 0 = … = p 3 = 0.3, where p 0 corresponds to the control treatment efficacy probability. The Dunnett test (Dunnett, 1984 ) is used for hypothesis testing in the FR setting. The hypothesis testing for GI and the proposed information theoretic weighted entropy (WE) design is performed using an adjusted Fisher exact test (Agresti, 1992 ). The adjustment chooses the cut-off values to achieve the same type I error as FR. The Bonferroni correction is used for GI and WE designs to correct for multiple testing and the familywise error rate is set to be less than or equal to 5%. Characteristics of interest are

the type I error rate α ,

statistical power 1 − η ,

the expected number of successes, ENS, and

the average proportion of patients on the optimal treatment, p * .

The design proposed requires a target value γ . Although in practice the target treatment effect can vary in various therapeutic areas, we consider the general setting in which no specific value is specified, and the arm with the highest success probability is of interest. We, therefore, use the highest possible value of a target probability, γ = 0.999. Investigating the dependence of the operating characteristics of the design on the target value γ in more detail, it was found that this choice might lead to a marginal decrease in ENS compared with the setting when the true maximum treatment effect is known while fixing the target probability below the true maximum treatment effect can lead to a noticeable decrease in it—see the on-line supplementary materials . The vector of the prior mode probabilities p (0) = (0.99, 0.99, 0.99, 0.99) T is chosen to reflect no prior knowledge about which arm has the highest success probability and the equipoise principle (Djulbegovic et al ., 2000 ). We choose β 0 = 5 for the observations on the control and β 1 = β 2 = β 3 = 2 for the experimental arms to reflect no prior knowledge for competing arms. The higher values for β 0 compared with β 1 ⁠ , β 2 and β 3 in the prior probabilities are intended to protect (to a certain extent) for higher numbers of patients on the control arm and to achieve a higher power. See Section 6 and the on-line supplementrary materials for a more detailed discussion on the influence of prior assumptions on the operating characteristics. We fix κ = 0.5 for the randomized allocation rule and denote it by WE Ran , and we search for the optimal robust values of κ for each sample size under the deterministic ‘select-the-best’ rule denoted by WE Det as given below. The software in the form of R code to reproduce the findings of the work is available from https://github.com/adaptive-designs/inf-theory .

4.2 Choice of the robust optimal penalization parameter κ

The design proposed requires the specification of the penalization parameter κ . We apply the procedure in Section 3 for the two objective functions:

maximizing ENS, and

achieving a particular level of statistical power.

We shall apply the procedure to the deterministic allocation rule.

Firstly, Z = 5000 random scenarios with m = 4 treatment arms are generated. For the ENS-criterion, we assume a uniform distribution on the probability of responses at each treatment arm, p j ∼ U ( 0 , 1 ) , j = 1 , 2 , 3 , 4 ⁠ . If there is some prior information on the plausible values of p j , it could be employed at this stage. For the power criterion, we power the trial under the least favourable configuration and generate the response probabilities as p 1 = p 2 = p 3 ∼ U ( 0 , 1 ) ⁠ , and p 4 = U ( p 1 , 1 ) ⁠ . We specify the values of κ on the grid κ = 0.50, 0.51, …, 1 and conduct the procedure for sample sizes that are considered in the examples ( n = 80 and n = 423) as well as an intermediate value n = 165 (used in the example in Section 5 ). We use 5000 Monte Carlo simulations to approximate the distribution of patients for each κ under each scenario. For the power objective function, we require that 80% of the power of the FR is achieved with probability at least 90%, ξ = 0.90. This requirement is imposed to be satisfied with high probability over the 5000 random scenarios. This means that in any given scenario the power achieved can be both above and below the 80% of the power that is achieved by the FR design. Note also that the procedure can be computationally expensive. For n = 423, for example, the full calibration procedure took around 70 h (Intel Core i7-8650U central processor unit at 1.90 GHz times eight) after being parallelized between five cores. One can reduce this time by reducing the number of Monte Carlo simulations and/or scenarios at the cost of lower precision in the optimal value of κ . The objective function g 1 (·) and the quantity of interest q 2 for various values of κ are given in Fig. 2 .

(a) Values of the objective function g1 divided by the sample size and (b) expected values of q2 for various values of the penalization parameter κ, and for various sample sizes n = 80 (), n = 165 () and n = 423 (): , , , robust optimal values of κ

For the maximum ENS-criterion, the optimal values of κ increase as the sample size increases. As expected, when optimizing the number of patients on the superior arms, a low value of the penalization parameter should be used if the sample size is small as more spread allocation will result in a decreased ENS. At the same time, for larger sample sizes, a low value of κ can result in allocating many patients to suboptimal arms, and the consequences of this become more severe as the sample size increases. Therefore, the value of κ = 0.51 and κ = 0.56 will be used for sample sizes n = 80 and n = 423 respectively to achieve the nearly maximum ENS.

Regarding the power criterion greater values of κ correspond to greater power and, as a result, to a greater probability that the desirable power will be achieved. For various sample sizes, the optimal values are found to be in the interval (0.65, 0.73). The minimum values of κ for which the probability to attain the target power is 90% are the robust optimal values and are used in the examples below when balancing ENS and the statistical power.

4.3 Results

The trade-off between the expected number of successes ENS and the statistical power for various values of the penalty parameter κ under the deterministic rule in both scenarios is illustrated in Fig. 3 .

(a), (b) ENS and (c), (d) power (fixed cut-off value) for the WE design under the deterministic rule for various κ (, , values of the penalization parameter κ chosen by using the ENS-criterion; , , values of κ chosen by using the power criterion): (a), (c) scenario 1; (b), (d) scenario 2

In both scenarios, greater values of κ correspond to greater power and lower ENS as the increase in penalty tends to more diverse allocations. The exception is κ ∈ (0.5, 0.55) in scenario 1 where the inconsistency for κ = 0.5 leads to locking in on the suboptimal treatment. Subsequently, we use the robust optimal κ that was found above for ENS (the broken line) and power (the dotted line) criteria.

The operating characteristics of the considered designs in scenario 1 are given in Table 2 . Under the null hypothesis, the performance of all methods is similar and the type I error is controlled. Under the alternative hypothesis, the WE Det -design with calibrated optimal κ = 0.56 performs comparably with the GI in terms of ENS with the GI design resulting in around two more responses on average but increases power by nearly 18 percentage points because of the greater number of patients on the control achieved through the penalization of the number of observations at each arm by using κ and the chosen prior β 0 ⁠ . Nevertheless, the statistical power is relatively low and can be increased by using higher values of the penalty parameter ( κ = 0.65). It leads to an increase in the power from 0.61 to 0.85 at the cost of the slight (approximately 4%) decrease in ENS. In fact, WE Det then has comparable power with that of FR, while treating almost 40 more patients on the superior treatment. Another way to increase the statistical power is to use WE Ran for which both the associated power and the ENS are higher than for FR.

Operating characteristics of the WE design under the randomization rule ( WE Ran ) and under the deterministic rule ( WE Det ) for various κ (in parentheses), the GI design and FR in scenario 1 with n = 423 under the null and alternative hypotheses †

Results are based on 10 4 replicated trials. Standard errors are in parentheses.

The operating characteristics of the designs in scenario 2 with fewer patients and different probabilities of response under the alternative are given in Table 3 . Under the null hypothesis, all designs perform similarly in terms of ENS and all control the type I error at the 5% level. Under the alternative hypothesis, the GI and WE Det with κ = 0.51, again, yield the highest (and similar) ENS among all alternatives, but also low statistical power. Note that, for the difference of 35% in power for both approaches, the GI design corresponds to the highly conservative type I error, nearly 0%, against 5% for WE Det (0.51). WE Ran or increased κ for WE Det result in a considerable increase in power. Both designs have a greater (or similar) power and result in more ENS than does the FR.

Operating characteristics of the WE design under the randomization rule ( WE Ran ) ⁠ , under the deterministic rule ( WE Det ) for various κ (in parentheses), the GI design and FR in scenario 2 with n = 80 under the null and alternative hypotheses †

Overall, the WE designs for the robust optimal values of κ perform comparably with or with minor differences from the optimal GI design in terms of ENS, but with greater statistical power for both large and small sample sizes. Importantly, the WE design proposed uses an optimal robust value that was previously extensively calibrated and was found to yield beneficial operating characteristics subject to tuning of the penalization parameter. The ENS- and power trade-off can be tuned via the built-in parameter κ . Specifically, for greater values of κ of the randomization rule, the WE designs can result in similar statistical power to that of FR, but with the considerably greater ENS.

4.4 Application of the minimum efficacy bound weight function

The information theoretic design that was studied above targets the most effective arm. This design does, however, not take into account that the response probabilities can be too low to be useful. Consequently, the selection of arms should be severely penalized if the response rate is below a minimum clinically interesting value ψ . To account for this minimum efficacy value ψ , the weight function ϕ n * and the exact information gain criterion given in equation (13) ) can be used. In Table 4 , we apply this information gain criterion under scenario 1 with sample size n = 423.

Operating characteristics of the WE design by using the exact information gain under the randomization rule ( WE Ran ) and under the deterministic rule ( WE Det ) for κ = 0.50 in scenario 1 with n = 423 under the null and alternative hypotheses †

Results are based on 2500 replicated trials. Standard errors are in parentheses.

Although, in an actual clinical trial, the minimum efficacy bound ψ will be determined by expert knowledge, we consider different bounds ψ = 0.0, 0.3, 0.4, 0.5, 0.6 to investigate how its value affects the operating characteristics. To track the influence of ψ , we study the designs by using a fixed value of κ = 0.50.

For both WE Ran and WE Det , the minimum efficacy bounds ψ ⩽ 0.30 result in similar operating characteristics to the gain in information with the weight function ϕ n without a minimum efficacy value as all treatment arms have greater efficacy probabilities compared with the bound. For ψ = 0.4 and ψ = 0.5, the first three arms are considered as inefficacious. Comparing with ψ = 0.30, the design results in a slightly higher power and in 5% more patients allocated to the superior arm. As ψ increases above the response probability of the superior arm, all treatment arms are considered as inefficacious, resulting in more spread allocations (as the gain in information is inflated for all arms) and lower ENS.

Overall, the design using the minimum efficacy bound weight function enables us to improve both power and ENS if the threshold is correctly specified. It can, however, also lead to a decrease in ENS if all the arms are considered as inefficacious.

5.1 Setting

In the previous example, a single binary end point was used. However, trials with co-primary efficacy end points are of growing interest in medical research (Zhou et al ., 2017 ). As the criterion proposed can be applied to a trial with an arbitrary number of discrete outcomes, in this section we investigate the performance of the novel response-adaptive design in a setting of a phase II trial in metastatic breast cancer that was considered by Song ( 2015 ).

In this phase II trial, the two key efficacy variables of interest were

the tumour objective response rate (ORR) and

the absence of the deterioration in the ‘Global health status (GHS) of European Organisation for the Research and Treatment of Cancer quality of life questionnaire core 30’ in the first two cycles of treatment.

As outlined by Song ( 2015 ) both end points are ‘relatively rapidly observable’ which makes the application of a response-adaptive design suitable. Given two co-primary binary efficacy end points, the response that was observed in each patient has four categories:

ORR and GHS,

ORR and no GHS,

no ORR and GHS, and

no ORR and no GHS.

Extending the setting that was considered by Song ( 2015 ), who considered a single-arm trial, we investigate the behaviour of designs in a more general framework with two treatment arms (indexed by 1 and 2) and a control arm (a standard of care, indexed by 0). Following the sample size that was considered in the single-arm trial, 55 patients, the sample size in the three-arms trial is fixed to be n = 55×3=165. Although four outcomes can be observed in the trial, phase II trials are conventionally formulated in terms of the marginal probabilities of each binary event rather than in terms of the probabilities of joint events. Let p orr, j be the probability of the ORR and p ghs, j be the probability of GHS corresponding to the treatment arm j . Motivated by the trial that was investigated by Song ( 2015 ), we consider the hypothesis H 0 : p k , 0 = p k , j for j = 1, 2 and k = {orr,ghs}. As in the single-end-point example, the hypothesis testing is performed by using Fisher’s adjusted exact test, where the adjustment chooses the cut-off value to achieve a 5% type I error (Villar, Bowden and Wason, 2015 ). Again, the Bonferroni correction is used to ensure that the familywise error rate is less than or equal to 5%. Characteristics of interest are

the average proportion of patients on the optimal treatment, p * , and

the expected number of the ORR, ENS (the expected number of the GHS is suppressed for brevity).

5.2 Design specification and comparators

To adapt the novel criterion to the formulated context, we employ a reparameterization of the probabilities of joint events under the assumption of independence. Then, the target of the trial can be defined in terms of the target probability of the ORR, γ orr , and the target probability of GHS, γ ghs . We define the probabilities of events for arm j as α j ( 1 ) = p orr , j p ghs , j ⁠ , α j ( 2 ) = p orr , j ( 1 − p ghs , j ) and α j ( 3 ) = ( 1 − p orr , j ) p ghs , j ⁠ , and the corresponding targets as γ (1) = γ orr γ ghs , γ (2) = γ orr (1 − γ ghs ) and γ (3) = (1 − γ orr ) γ ghs .

Following the single-agent example, we specify the parameters for the proposed response-adaptive design as follows. As the upper bound for the ORR and GHS is not defined, the target values γ orr = γ ghs = 1 are taken to ensure that the arm corresponding to the highest probability is chosen. Given the reparameterization, both probabilities are considered as beta random variables. The vectors of the prior mode probabilities p orr ( 0 ) = ( 0.99 , 0.99 , 0.99 ) T and p ghs ( 0 ) = ( 0.99 , 0.99 , 0.99 ) T are chosen to reflect equipoise. Again, we choose the following parameters of the beta distribution for both probabilities: β 0 = 5 to ensure enough observations on the control arm and β 1 = β 2 = 2 to reflect no prior knowledge for competing arms. As before, we fix κ = 0.5 for the deterministic allocation rule and use the robust optimal values of κ for n = 165 obtained in Section 4 .

We compare the performance with two alternative approaches: the first is a multiarm bandit approach that prioritizes the exploitation objective and, therefore, is expected to result in high ENS; and the second is known to result in high power. The multiarm bandit approach seeking to obtain high ENS described below is referred to as ‘Max Prob’. As for the proposed information theoretic approach, under the independence assumption, we consider each efficacy end points as beta random variables and assign each subsequent patient to the arm that corresponds to the maximum probability of having the highest p orr and the highest p ghs together (Wathen and Thall, 2017 ). Formally, the next patient is allocated to treatment arm j * such that j * = arg max j [ P { p orr , j = max i ( p orr , i ) } P { p ghs , j = max i ( p ghs , i ) } ] ⁠ . The R package bandit is used to compute these probabilities (Lotze and Loecher, 2014 ). The design uses the same prior distribution as the design proposed. FR is used as a comparator that is expected to achieve high power.

5.3 Results

The operating characteristics for two possible cases (based on the original study) are given in Table 5 . The results show that the performance of the novel response-adaptive design are qualitatively similar to that of the previous example. Under the null hypothesis (scenario 1), the performance of all methods is similar and the type I error is controlled. Under the alternative hypotheses (scenario 2), the WE Det -design with the calibrated value of κ = 0.54 results in a similar proportion of patients assigned to the superior arm to that of Max Prob (0.70 against 0.71) and nearly the same average number of ORR observed in the trial. However, as in the previous example, the WE Det -design results in higher statistical power (0.49 against 0.33). To increase power (at the cost of lower ORR rate) a higher value of κ can be used or randomization between arms can be employed. The value κ = 0.69 using WE Det leads to an increase in power to 0.63 for the cost of a slight decrease in ENS by approximately one treated patients. Moreover, WE Det (0.69) implies nearly the same power as does FR but results in nearly eight more patients treated on the better treatment on average. Alternatively, using the randomized assignment rule, the proposed design results in even higher statistical power (0.67) but in five fewer ENS compared with WE Det (0.69) which is still higher than for FR.

Operating characteristics of the WE design under the randomization rule ( WE Ran ) ⁠ , under the deterministic rule ( WE Det ) for various κ (in parentheses), the Max Prob design and FR in the trial with two co-primary efficacy end points and n = 165 under the null and alternative hypotheses †

Overall, the example with co-primary efficacy end points supports the previously found results. For the tuned robust optimal values of the penalization parameter, the design proposed can perform comparably in terms of ENS with the multiarm bandit design that prioritizes exploitation but can outperform it in terms of the power. Moreover, the WE design can result in similar power to that of FR but noticeably greater ENS.

In this work, we proposed a general criterion for the selection of arms in experiments with multinomial outcomes that is based on the weighted information measure and is of particular use in the setting with ethical and strict sample size constraints. We considered two families of weight functions and demonstrated how the criterion proposed can be used for an arbitrary weight function reflecting various objectives of experiments that are of interest to investigators. For the weight functions considered, the information gain criteria preserve the flexibility and enable tailoring the design parameters in light of the exploration–exploitation trade-off. The design parameters should be carefully tuned before application of the design to ensure desirable statistical properties of the design with high probability and competitive advantages over the design that was considered in this work. Such a tuning procedure to find the optimal robust design for a generic objective function is proposed.

The prior distribution that was used in the illustrative examples was chosen to protect allocation of patients to the control arm in a non-rule-based manner—by design itself. However, alternative specifications of prior distributions can be considered. In general, a prior distribution that does not secure more patients on the control arm (either through β 0 or p (0) ) will require higher values of the penalization parameter κ to reach the same level of power compared with the prior ensuring more patients on the control arm. In fact, the design under each of these priors would require a search of the robust optimal penalization parameter κ as described in Section 3 . We refer the reader to the on-line supplementrary materials for an evaluation of different prior distributions on the properties of the design for various values of κ . It was found that, given the investigator’s preferences in the power–ENS balance, other prior distributions can provide gains that are similar to those found for the prior assumptions considered.

Throughout the paper, we have intentionally focused on two competing methods only under each example to provide a benchmark for comparison and to focus on the method proposed. In the examples provided above, it was found that, when compared with some multiarm bandit approaches that favour exploitation, the proposed design for the robust optimal values of κ that were found and prior distribution considered can yield better power while resulting only in a minor reduction in ENS. At the same time, there are other modified multiarm bandit approaches that were proposed in the literature that could be applied to the problems considered and can result in a better power–ENS balance than original counterparts. Specifically, Villar, Wason and Bowden ( 2015 ) proposed a randomized version of the GI design to tackle the exploitation–exloration trade-off. Furthermore, Villar, Bowden and Wason ( 2015 ) proposed the GI modification that imposes a rule-based mechanism on controlling the number of patients on the control treatment and found that it leads to a noticeable improvement in power but only minor losses in ENS. A similar controlling procedure could be imposed on the design proposed. Similarly, a GI index is defined for multinomial outcomes (Glazebrook, 1978 ) that could be an alternative approach for the problem with co-primary outcome studied in Section 5 . A comprehensive comparison of these procedures in a large number of potential simulation scenarios is of interest and is the subject of future research.

Throughout the work, the examples concerned phase II clinical trials evaluating efficacy. However, the design was also found to provide benefits in the setting of phase I clinical trials seeking to select the maximum tolerated dose (i.e. the target probability γ is the toxicity probability at the maximum tolerated dose), particularly when the assumption of monotonicity is questionable. We refer the reader to the on-line supplementary materials for the corresponding results. Therefore, although phase I and phase II trials state two different questions, the general formulation of the design proposed (to target the TA with specific characteristics) enable its application in both settings and in a wide range of trials.

In the evaluations presented, a fixed target value of γ = 0.999 was considered. At the same time, there are many clinical settings in which the maximum clinically feasible efficacy probability can be specified before the trial. We study the effect of the target value in many different scenarios in the on-line supplementary materials . Setting the target below the true response probability results in targeting an inferior arm and worsening the performance in terms of both ENS and statistical power. Therefore, it is preferred to be more conservative and to ensure that the target probability is sufficiently high. Studying various target values, we found that, under the ‘select-the-best’ allocation rule, the influence of the target value on the operating characteristics is small. Under the randomization allocation rule, however, specification of the target value close to the true maximum value yields noticeably more patients on the superior arm.

An important assumption that is employed by the proposed response-adaptive information theoretic design (as well as for the majority of alternative response-adaptive procedures; Villar, Bowden and Wason ( 2015 )) is that the patients’ responses are observed shortly after the treatment, or at least before the next patient is to be enrolled in the study. This, however, might not hold in many clinical trials. Consequently, the question of incorporating delayed responses is of great practical interest.

In this work, multinomial outcomes were considered only. Generalizing the proposed approach to experiments with continuous outcomes is the subject of future research together with its non-parametric extension.

Although clinical trials have been the main motivation for this work, the design can be applied to a wide range of problems of similar nature. For example, areas where the multiarm bandit approach has found applications are on-line advertising, portfolio design and queuing and communication networks (see Gittins et al . ( 2011 ), and references therein). In these settings, however, the sample size is not one of the main constraints in contrast with the clinical trial setting that was considered in this work. Nevertheless, the general principles proposed can be applied in these problems and their merits in the setting with easy-to-collect observations are to be studied. On top of that, the design proposed can be used in more general problems of selecting an arm corresponding to target value γ rather than the selection of the highest success probability only. It is important to emphasize that the selection criterion derived can be also applied in conjunction with parametric models, which also expands its possible applications. In fact, the parameters can be estimated by any desirable method and then ‘plugged in’ in the criterion which preserves its properties.

Additional ‘supporting information’ may be found in the on-line version of this article:

‘An information-theoretic approach for selecting arms in clinical trials: Supplementary materials’.

The authors acknowledge the insightful and constructive comments made by the Joint Editor, Associate Editor and three reviewers. These comments have greatly helped us to refine the original submission. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement 633567 and is independent research supported by the National Institute for Health Research (National Institute for Health Research Advanced Fellowship, Dr Pavel Mozgunov, NIHR300576) and by Professor Jaki’s Senior Research Fellowship (NIHR-SRF-2015-08-001) supported by the National Institute for Health Research. The views that are expressed in this publication are those of the authors and not necessarily those of the National Health Service, the National Institute for Health Research or the Department of Health and Social Care. TJ is also supported by the UK Medical Research Council (grant MC UU 0002/14).

Agresti , A. ( 1992 ) A survey of exact inference for contingency tables . Statist. Sci. , 7 , 131 – 153 .

Google Scholar

Aitchison , J. ( 1982 ) The statistical analysis of compositional data (with discussion) . J. R. Statist. Soc. B, 44 , 139 – 177 .

Aitchison , J. ( 1992 ) On criteria for measures of compositional difference . Math. Geol. , 24 , 365 – 379 .

Azriel , D. , Mandel , M. and Rinott , Y. ( 2011 ) The treatment versus experimentation dilemma in dose finding studies . J. Statist. Planng Inf. , 141 , 2759 – 2768 .

Barrett , J. E. ( 2016 ) Information-adaptive clinical trials: a selective recruitment design . Appl. Statist. , 65 , 797 – 808 .

Belis , M. and Guiasu , S. ( 1968 ) A quantitative-qualitative measure of information in cybernetic systems . IEEE Trans. Inform. Theory , 14 , 593 – 594 .

Browne , C. B. , Powley , E. , Whitehouse , D. , Lucas , S. M. , Cowling , P. I. , Rohlfshagen , P. , Tavener , S. , Perez , D. , Samothrakis , S. and Colton , S. ( 2012 ) A survey of Monte Carlo tree search methods . IEEE Trans. Computnl Intell. AI Games , 4 , 1 – 43 .

Clertant , M. and O’Quigley , J. ( 2017 ) Semiparametric dose finding methods . J. R. Statist. Soc. B, 79 , 1487 – 1508 .

Clim , A. ( 2008 ) Weighted entropy with application . Anal. Univ. Buc. Mat. Anul , 57 , 223 – 231 .

Cover , T. M. and Thomas , J. A. ( 2012 ) Elements of Information Theory . New York : Wiley .

Google Preview

Djulbegovic , B. , Lacevic , M. , Cantor , A. , Fields , K. K. , Bennett , C. L. , Adams , J. R. , Kuderer , N. M. and Lyman , G. H. ( 2000 ) The uncertainty principle and industry-sponsored research . Lancet , 356 , 635 – 638 .

Dunnett , C. W. ( 1984 ) Selection of the best treatment in comparison to a control with an application to a medical trial. In Design of Experiments: Ranking and Selection (eds T. J. Santner and A. C. Tamhane ), pp. 47 – 66 . New York : Dekker .

Gittins , J. , Glazebrook , K. and Weber , R. ( 2011 ) Multi-armed Bandit Allocation Indices . Chichester : Wiley .

Gittins , J. C. and Jones , D. M. ( 1979 ) A dynamic allocation index for the discounted multiarmed bandit problem . Biometrika , 66 , 561 – 565 .

Glazebrook , K. ( 1978 ) On the optimal allocation of two or more treatments in a controlled clinical trial . Biometrika , 65 , 335 – 340 .

Guo , B. , Li , Y. and Yuan , Y. ( 2016 ) A dose–schedule finding design for phase I–II clinical trials . Appl. Statist. , 65 , 259 – 272 .

Iasonos , A. , Wages , N. A. , Conaway , M. R. , Cheung , K. , Yuan , Y. and O’Quigley , J. ( 2016 ) Dimension of model parameter space and operating characteristics in adaptive dose-finding studies . Statist. Med. , 35 , 3760 – 3775 .

Jones , D. ( 1975 ) Search procedures for industrial chemical research. PhD Thesis. University of Cambridge , Cambridge .

Jones , D. M. ( 1970 ) Sequential method for industrial chemical research. PhD Thesis. University of Wales .

Kelbert , M. and Mozgunov , P. ( 2015 ) Asymptotic behaviour of the weighted Renyi, Tsallis and Fisher entropies in a Bayesian problem . Eurasn Math. J. , 6 , 6 – 17 .

Kelbert , M. and Mozgunov , P. ( 2017 ) Generalization of Cramér-Rao and Bhattacharyya inequalities for the weighted covariance matrix . Math. Communs , 22 , 25 – 40 .

Kelbert , M. , Suhov , Y. , Izabella , S. and Yasaei , S. S. ( 2016 ) Basic inequalities for weighted entropies . Aequn. Math. , 90 , 1 – 32 .

Kim , S. B. and Gillen , D. L. ( 2016 ) A Bayesian adaptive dose-finding algorithm for balancing individual- and population-level ethics in Phase I clinical trials . Sequent. Anal. , 35 , 423 – 439 .

Klotz , J. ( 1978 ) Maximum entropy constrained balance randomization for clinical trials . Biometrics , 34 , 283 – 287 .

Koenig , F. , Brannath , W. , Bretz , F. and Posch , M. ( 2008 ) Adaptive Dunnett tests for treatment selection . Statist. Med. , 27 , 1612 – 1625 .

Lee , S. M. , Ursino , M. , Cheung , Y. K. and Zohar , S. ( 2019 ) Dose-finding designs for cumulative toxicities using multiple constraints . Biostatistics , 20 , 17 – 29 .

Lotze , T. and Loecher , M. ( 2014 ) bandit: functions for simple A/B split test and multi-armed bandit analysis . R Package Version 0.5.0 . (Available from https://CRAN.R-project.org/package=bandit .)

Magirr , D. , Jaki , T. and Whitehead , J. ( 2012 ) A generalized Dunnett test for multi-arm clinical studies with treatment selection . Biometrika , 99 , 494 – 501 .

Mozgunov , P. and Jaki , T. ( 2019 ) An information theoretic phase I–II design for molecularly targeted agents that does not require an assumption of monotonicity . Appl. Statist. , 68 , 347 – 367 .

Mozgunov , P. and Jaki , T. ( 2020 ) Improving safety of the continual reassessment method via a modified allocation rule . Statist. Med. , 39 , 906 – 922 .

Mozgunov , P. , Jaki , T. and Gasparini , M. ( 2019 ) Loss functions in restricted parameter spaces and their Bayesian applications . J. Appl. Statist. , 46 , 2314 – 2337 .

O’Quigley , J. , Iasonos , A. and Bornkamp , B. ( 2017 ) Handbook of Methods for Designing, Monitoring, and Analyzing Dose-finding Trials . Boca Raton : CRC Press .

Pushpakom , S. , Kolamunnage-Dona , R. , Taylor , C. , Foster , T. , Spowart , C. , García-Fiñana , M. , Kemp , G. J. , Jaki , T. , Khoo , S. , Williamson , P. and Pirmohamed , M. for the TAILOR Study Group ( 2019 ) TAILoR (TelmisArtan and InsuLin Resistance in Human Immunodeficiency Virus [HIV]): an adaptive-design, dose-ranging phase IIb randomized trial of telmisartan for the reduction of insulin resistance in HIV-positive individuals on combination antiretroviral therapy . Clin. Infect. Dis. , 70 , no. 10 .

Ristl , R. , Urach , S. , Rosenkranz , G. and Posch , M. ( 2018 ) Methods for the analysis of multiple endpoints in small populations: a review . J. Biopharm. Statist. , 29 , 1 – 29 .

Riviere , M.-K. , Dubois , F. and Zohar , S. ( 2015 ) Competing designs for drug combination in phase I dose-finding clinical trials . Statist. Med. , 34 , 1 – 12 .

Shannon , C. E. ( 1948 ) A mathematical theory of communication . Bell Syst. Tech. J. , 27 , 379 – 423 .

Smith , A. L. and Villar , S. S. ( 2018 ) Bayesian adaptive bandit-based designs using the Gittins index for multi-armed trials with normally distributed endpoints . J. Appl. Statist. , 45 , 1052 – 1076 .

Song , J. X. ( 2015 ) A two-stage design with two co-primary endpoints . Contemp. Clin. Trials Communs , 1 , 2 – 4 .

Stallard , N. and Todd , S. ( 2003 ) Sequential designs for phase III clinical trials incorporating treatment selection . Statist. Med. , 22 , 689 – 703 .

Tate , R. F. ( 1955 ) The theory of correlation between two continuous variables when one is dichotomized . Biometrika , 42 , 205 – 216 .

Thall , P. F. and Cook , J. D. ( 2004 ) Dose-finding based on efficacy–toxicity trade-offs . Biometrics , 60 , 684 – 693 .

Villar , S. S. , Bowden , J. and Wason , J. ( 2015 ) Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges . Statist. Sci. , 30 , 199 – 215 .

Villar , S. S. , Bowden , J. and Wason , J. ( 2018 ) Response-adaptive designs for binary responses: how to offer patient benefit while being robust to time trends? Pharm. Statist. , 17 , 182 – 197 .

Villar , S. S. , Wason , J. and Bowden , J. ( 2015 ) Response-adaptive randomization for multi-arm clinical trials using the forward looking Gittins index rule . Biometrics , 71 , 969 – 978 .

Wages , N. A. , Conaway , M. R. and O’Quigley , J. ( 2011 ) Continual reassessment method for partial ordering . Biometrics , 67 , 1555 – 1563 .

Wages , N. A. , O’Quigley , J. and Conaway , M. R. ( 2014 ) Phase I design for completely or partially ordered treatment schedules . Statist. Med. , 33 , 569 – 579 .

Wathen , J. K. and Thall , P. F. ( 2017 ) A simulation study of outcome adaptive randomization in multi-arm clinical trials . Clin. Trials , 14 , 432 – 440 .

Whitehead , J. and Williamson , D. ( 1998 ) Bayesian decision procedures based on logistic regression models for dose-finding studies . J. Biopharm. Statist. , 8 , 445 – 467 .

Williamson , S. F. , Jacko , P. , Villar , S. S. and Jaki , T. ( 2017 ) A Bayesian adaptive design for clinical trials in rare diseases . Computnl Statist. Data Anal. , 113 , 136 – 153 .

Williamson , S. F. and Villar , S. S. ( 2019 ) A response-adaptive randomization procedure for multi-armed clinical trials with normally distributed outcomes . Biometrics , 76 , 197 – 209 .

Zhou , H. , Lee , J. J. and Yuan , Y. ( 2017 ) BOP2: Bayesian optimal design for phase II clinical trials with simple and complex endpoints . Statist. Med. , 36 , 3302 – 3314 .

Supplementary data

Email alerts, citing articles via.

Recommend to Your Librarian
Advertising & Corporate Services
Journals Career Network
Email Alerts

Affiliations

Online ISSN 1467-9868
Print ISSN 1369-7412
Copyright © 2024 Royal Statistical Society
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Open access
Published: 26 June 2021

Recommendations for designing and analysing multi-arm non-inferiority trials: a review of methodology and current practice

Jake Emmerson ORCID: orcid.org/0000-0002-8198-8043 1 ,
Susan Todd 2 &
Julia M. Brown 1

Trials volume 22 , Article number: 417 ( 2021 ) Cite this article

3855 Accesses

2 Citations

9 Altmetric

Metrics details

Background and purpose

Multi-arm non-inferiority (MANI) trials, here defined as non-inferiority trials with multiple experimental treatment arms, can be useful in situations where several viable treatments exist for a disease area or for testing different dose schedules. To maintain the statistical integrity of such trials, issues regarding both design and analysis must be considered, from both the multi-arm and the non-inferiority perspectives. Little guidance currently exists on exactly how these aspects should be addressed and it is the aim of this paper to provide recommendations to aid the design of future MANI trials.

A comprehensive literature review covering four databases was conducted to identify publications associated with MANI trials. Literature was split into methodological and trial publications in order to investigate the required design and analysis considerations for MANI trials and whether they were being addressed in practice.

A number of issues were identified that if not properly addressed, could lead to issues with the FWER, power or bias. These ranged from the structuring of trial hypotheses at the design stage to the consideration of potential heterogeneous treatment variances at the analysis stage. One key issue of interest was adjustment for multiple testing at the analysis stage. There was little consensus concerning whether more powerful p value adjustment methods were preferred to approximate adjusted CIs when presenting and interpreting the results of MANI trials.

We found 65 examples of previous MANI trials, of which 31 adjusted for multiple testing out of the 39 that were adjudged to require it. Trials generally preferred to utilise simple, well-known methods for study design and analysis and while some awareness was shown concerning FWER inflation and choice of power, many trials seemed not to consider the issues and did not provide sufficient definition of their chosen design and analysis approaches.

Conclusions

While MANI trials to date have shown some awareness of the issues raised within this paper, very few have satisfied the criteria of the outlined recommendations. Going forward, trials should consider the recommendations in this paper and ensure they clearly define and reason their choices of trial design and analysis techniques.

Peer Review reports

Non-inferiority trials are used for determining if new treatments are no more than a pre-determined amount less efficacious than the current standard treatment. They can be particularly advantageous in studies where new treatments provide an alternative benefit to the patient or funder. This includes potentially being less invasive, less toxic, less costly or less time-consuming to administer. It is also becoming increasingly important for trials to be able to run efficiently in order to provide useful and potentially practice-changing information in a timely manner. Currently, one way in which this extra efficiency is achieved is by running trials with multiple arms, where shared control data are compared with the experimental arms. Multi-arm trials can reduce the cost of running studies and decrease the required number of patients to carry out a trial when compared to creating separate trials for each experimental treatment option.

It is not uncommon for non-inferiority trials to include a third arm, further to the experimental and active control arms. In a ‘gold-standard’ non-inferiority design often defined thus within literature [ 1 , 2 ], the third arm would be a placebo, included to allow a test for assay sensitivity, which ensures that the treatments being compared for non-inferiority are not ineffective themselves. Guidance from the European Medicines Agency (EMA) states that, where ethically allowable, it is preferable for a placebo arm to be included in non-inferiority trials [ 3 ]. It is less common for non-inferiority trials to include multiple experimental arms than in a superiority setting, despite the potential advantages of doing so.

There are a number of situations in which non-inferiority trials with multiple experimental arms could be, and indeed have been (see the “ MANI trials in practice ” section), useful. In disease areas where different viable treatments exist, it would be preferable to test multiple different (potentially new) treatments against one another or against a common control simultaneously if any of the arms are preferable in terms of their toxicity or cost. A trial may wish to test various doses of the same treatment or different dosing discontinuation schedules in order to ensure they still provide an acceptable outcome to the standard treatment schedule. The key difference between these scenarios is the relatedness of the treatment arms and how the hypotheses for such trials may be set up. Both scenarios are of interest within this paper.

A common issue that can arise when carrying out analyses on any trial with multiple experimental arms is potential inflation of the family-wise error rate (FWER), that is, the probability of making at least one type I error from a set of multiple comparisons. When carrying out multi-arm non-inferiority trials, multiple comparisons are made between treatment arms, either between one another or against the control treatment(s). It is important in this case that we ensure the FWER is controlled in order to reduce the chance of making erroneous claims.

This paper investigates the methodology and current conduct of frequentist fixed sample size trials with multiple experimental arms including either an active control arm (or arms) or a placebo arm, or both, where the primary and/or key secondary hypotheses are analysed in a non-inferiority framework. These will hereafter be referred to as multi-arm non-inferiority (MANI) trials. In order to do this, separate searches were carried out to identify literature outlining methodological issues and required considerations for MANI trials and to find examples of where such trials have been carried out in practice.

The aims of this manuscript are to summarise the statistical concerns raised in the literature around running MANI trials and to assess whether or not these considerations are addressed in practice, looking at current and past trials. This will be done by first identifying the key statistical issues involved in designing MANI trials and the considerations required when addressing these issues, before evaluating and comparing the methodological approaches that can be used when analysing MANI trials. The first section will provide a high-level outline of the statistical issues found with reference to some of the available methods for addressing them. These issues will affect both the design and analysis phase of MANI trials but should all be considered within the design phase as part of a statistical analysis plan for a trial. In the second section, the conduct of current and past MANI trials will be assessed and summarised in order to show whether the issues raised are being addressed in practice. The research is brought together to give recommendations for how MANI trials can be carried out in practice and which methods are most appropriate to implement in different trial scenarios.

Literature search methodology

A comprehensive literature review was conducted to obtain and analyse all current literature regarding statistical methodology and design considerations required when conducting MANI trials and all current and previous MANI trials that have been carried out. Search terms were developed for the following major electronic databases: MEDLINE (Ovid), EMBASE (Ovid), Science Citation Index (Web of Science) and the Cochrane Library (Wiley), each from inception. The search terms are provided in Appendix A . The first search was conducted in February 2020 and auto alerts were set up to ensure further publications released prior to publication were not missed. Additional publications were identified by searching references and citations of useful literature.

From the original search, publications were split into methodological literature and trial literature based on title and abstract review, whereupon papers were read in full to assess whether they were suitable for inclusion within the review. The details of the search can be found in Fig. 1 .

Literature search flowchart

In addition, an assessment of regulatory, guidance and review documents on non-inferiority trials was carried out to identify further possible trial considerations required that could be relevant. These mostly included guidance documents and reviews from groups of experts.

Methodological papers were considered in the MANI setting and for ‘gold-standard’ non-inferiority trials (as defined above); these were assessed to identify whether they were also applicable in the MANI setting and whether and how the methods changed when doing so. If papers only considered methods for ‘gold standard’ trials, they were excluded. Publications giving general guidance around non-inferiority designs were also included, as well as papers that gave more general guidance around multiplicity adjustment. Other criteria for excluding methods based papers were Bayesian methods, non-MANI methods (i.e. non-MANI and not ‘gold standard’ design based methods), phase II trial methods and papers where insufficient information was given to add to the review (i.e. wider reviews of statistical considerations in a disease area, comments on other methodological papers and abstracts for which further information could not be found).

When searching for practical examples of MANI trials, it was noticed that while the defined literature search strategies identified the majority of such trials, not all trial publications stated specifically that they were multi-arm designs and thus were not included in the results of the original searches. Only MANI trials, i.e. those involving multiple experimental arms (as described in the introduction), were included in the results presented in this paper; three-arm ‘gold-standard’ non-inferiority trials were excluded, as were trials on non-human subjects. We considered trials without a placebo arm to be eligible. The other exclusion criteria for trial papers were non phase III trials, trials where MANI analyses were exploratory only, two-armed trials or trials where analyses were not non-inferiority, abstracts and larger reviews of several trials (in this case searches were carried out for individual trials within these publications).

Statistical considerations for MANI trials

This section summarises the statistical issues and methodology found within the literature review. When initially searching the methodological literature, there were a large number of publications found that mentioned multi-arm trials and that mentioned non-inferiority trials, but few were found that considered both within the same setting. Across the four databases searched, 68 papers were found that mentioned methodology within non-inferiority trials with multiple arms, of which, upon further reading, 9 were found to be directly applicable to the scope of this review [ 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 ]. These papers cover a variety of methodological areas concerning the design and analysis of MANI trials. A further 17 papers were found when searching references and citations of papers found within the review. This larger number of papers, being found from references rather than initial searches, is primarily thought to be due to the FDA and EMA guidance documents not appearing in the initial searches in addition to papers that, while not exclusively written as MANI specific methodology papers, still contain concepts and ideas that can be applied within a MANI framework, whether originally referring to multi-arm trials (superiority or in general) or to two-arm or ‘gold-standard’ (as previously defined) non-inferiority trials.

The next subsections outline and summarise the statistical considerations that need to be made when designing and analysing MANI trials. The need for these issues to be addressed will change on a trial by trial basis. The issues, in some cases, are not unique to MANI trials, but may require an alternative approach or further thought than is required in other types of trial. If not properly addressed, the issues outlined may result in bias being introduced into the trial or inflation/deflation of the type I and type II errors of a trial which can ultimately undermine the ability to form strong conclusions from a trial and trust any statistical inferences made.

Ordering and structuring of hypotheses

The setting up of hypotheses is a fundamental requirement for any clinical trial; they put the research question into terms that trial data can seek to answer. Non-inferiority trials have the added complexity of setting out a suitable non-inferiority margin for the experimental treatments and multi-arm trials have to consider whether the implications of their hypotheses mean that adjustment for multiple testing is required in order to control the FWER of the trial. While adjustment for multiple testing in clinical trials is not a new concept, there are still issues around the best way to do so when carrying out MANI trials, particularly in the analysis stage (see later sections). If required, multiple testing adjustments can be made either through choice of trial design, through choice of analysis method or both.

The question of whether adjustment is required is an area of strong debate within the literature. The recent CONSORT extension for multi-arm trials stated that it is “a challenging issue” and while advocating that trials should state their reasons for adjusting or not adjusting, they refrain from making explicit statements as to when this should be done [ 13 ]. Determining whether adjustment is warranted should be decided appropriately for each trial after reviewing the literature. Some of the potential areas to consider when designing a trial are summarised throughout the remainder of this section along with some of the schools of thought in these areas.

One of the key indicators as to whether multiplicity adjustment may be required when analysing a multi-arm trial with a shared control group is the structuring of the hypotheses. Howard et al. (and references therein) [ 14 ] summarise many of the current opinions within the literature around whether adjustment is required in multi-arm trials, stating that most of the disagreement surrounds the definition of a family of hypotheses. This paper is not specific to MANI trials and generally uses superiority trials within its examples, but the ideas can be applied to all multi-arm trials with a shared control group. The authors state that many views are “based on philosophical opinion rather than statistical theory”.

Howard’s philosophy on the subject is that the ordering and nesting of the hypotheses is critical when making a decision regarding adjustment [ 14 ]. Their criteria for adjustment are that if hypotheses are used together in order to make a single claim of efficacy and all individual hypotheses have to be rejected in order to reject a global hypothesis, then it is not necessary to adjust to control the FWER; however, it may be necessary to adjust for the increased probability of observing multiple type I errors simultaneously. If the hypotheses do not all have to be rejected to claim efficacy (but they all form a single claim of efficacy), then it is necessary to adjust to control the FWER. If the arms lead to individual claims of efficacy, it is argued that no adjustment is required as this is where the trial is being penalised for increased efficiency over carrying out multiple two-armed trials.

The decision of how to structure the hypotheses in order to best answer a research question is also reflected within the choice of power to be used within a trial. Westfall and Young [ 15 ] give a summary of the different possible choices of power and in which trial situations they are appropriate, for both MANI and other types of multi-arm trial. These include the all pair power, which is the probability of correctly rejecting all false hypotheses, the any pair power, which is the probability of correctly rejecting at least one false hypothesis and the per pair power, which is the probability of rejecting a specific false hypothesis (generally the hypothesis of highest interest).

This interest in specific trial arms can also be reflected in the choice of contrast coefficients when comparing multiple trial arms. Contrast tests are a useful strategy to use when assessing multiple doses of an experimental treatment, particularly within dose response detection [ 16 ]. They involve applying different weights to trial arms, whether to reflect an increasing dose in arms or to allow comparison of trial arms against one another as well as against a common control. Chang [ 4 ] looks at using contrast tests in MANI and multi-arm superiority trials with continuous, binary and survival endpoints and summarises how different choices of contrast tests can affect the power and the overall sample size.

Dmitrienko et al. [ 17 ] give a good summary of the different methods of setting up multiple hypothesis tests in multi-arm trials (not specifically MANI) and which methods require multiplicity adjustment. This includes union-intersection (UI) and intersection-union (IU) testing where either only one hypothesis must be rejected for a global null hypothesis to be rejected or that all individual hypotheses must be rejected to do so. They also summarise closed testing procedures and partitioning tests which are more powerful than assessing hypotheses on an individual basis while still allowing the FWER to be strongly controlled. Closed testing procedures, one of the more common methods of adjustment, involve creating a closed ‘family’ of hypotheses, for which every intersection between hypotheses is tested at a local level; hierarchical testing of hypotheses can correspond to simple closed testing procedures. Partitioning tests involve splitting unions of hypotheses into multiple, mutually exclusive hypotheses and testing them individually.

Hasler and Hothorn [ 5 ] use UI and IU testing principles in their procedures when assessing different approaches to analysing MANI trials with multiple correlated endpoints. They summarise their preferred analysis approaches for various hypothesis structures and outline where adjustment for multiple testing is required. These structures include global non-inferiority where the alternative hypothesis for every treatment and endpoint must be rejected in order to conclude non-inferiority and testing for global non-inferiority for a treatment group where alternative hypotheses for every endpoint within a single arm must be rejected in order to conclude non-inferiority for that arm.

Changing hypotheses when carrying out MANI trials

In addition to the initial considerations around the structuring of the hypotheses, it may also be of interest to change the hypothesis of a trial upon observation of a result (preferably having pre-specified this possibility before running the trial). Typically, this is carried out when switching from a non-inferiority test to a superiority test. If this was to be carried out for a single experimental arm and one endpoint and the primary endpoint remains the same, then there is no increase in the possible type I error as the switch corresponds to a simple closed testing procedure [ 18 ]. However, the FDA [ 19 ] warn that once multiple arms (or endpoints) are included in a trial where non-inferiority and superiority are tested, there can be inflation in the FWER and say that adjustment may be required.

Ke et al. [ 20 ] look at the scenario where in addition to switching from non-inferiority to superiority for a primary endpoint, another secondary endpoint is tested for superiority in a hierarchical fashion. They discovered that in this case, whether for a MANI trial or a two-armed non-inferiority trial, the FWER can be inflated and thus a suitable multiplicity adjustment should be made. More specifically, without multiplicity adjustment, “the type I error rate increases as the non-inferiority margin gets larger and inflation is more severe for moderately positive correlation between the two endpoints”. This is addressed in Lawrence’s [ 10 ] paper where he develops a closed testing procedure for the scenario described in Ke et al.’s paper that suitably controls the FWER even with multiple experimental treatments. Zhong et al. [ 12 ] also address the issue of simultaneously testing for non-inferiority and superiority in MANI trials; their solution for ensuring strong control of the FWER was to implement adjustment methods within the trial analysis itself, rather than using a closed testing procedure within the setup of the hypotheses.

Choice of non-inferiority margin definition and subsequent analysis method when carrying out MANI trials

The choice of method with which to assess non-inferiority is dependent on whether it is possible to include a placebo arm within a trial. While acknowledging the ethical issues around if it is appropriate to do so (these are documented and summarised within ICH E10 [ 21 ]), the EMA recommend that a placebo should be included within a non-inferiority trial wherever possible [ 3 ].

Huang et al. [ 7 ] outline that when a placebo arm can be included within a MANI trial, there are two options for how to set up the non-inferiority margin, with all analysis methods falling into one of the two groups. These are referred to as the “so-called fraction methods, which formulate the NI margin as a fraction of the trial sensitivity” (see Pigeot et al. for an example [ 22 ]) and the approach where “the NI margin is expressed in terms of the difference of the effects of the E (experimental) and R (reference - referred to as control here)”. This is the method applied in Kwong’s papers on hypothesis testing procedures in MANI trials [ 8 , 9 ] and according to Huang is “more popular in clinical studies”.

In their guidance on selecting an appropriate non-inferiority margin, the EMA [ 3 ] recommend that sufficient thought is given as to the exact aims of the non-inferiority trial before a decision is made on the choice of analysis method. They argue against the use of fraction based methods when the aim of the trial is “to show that there is no important loss of efficacy if the test product is used instead of the reference” and say they are only suitable if the main aim of the trial is to show that the experimental treatment is (or would be) superior to placebo. They do not recommend a single best method of analysis, but give guidance on possible considerations depending on the state of the disease area.

If it is not possible to include a placebo arm within a non-inferiority trial, the assay sensitivity of the control arm can only be argued on a historical basis. In their guidance on non-inferiority trials, the FDA [ 19 ] outline two ways of approaching the analysis if this is the case: the fixed margin approach and the synthesis method. Both of these methods are within the second group outlined by Huang et al. [ 7 ] where the NI margin is given as an expression of the difference between the experimental and control arms. The FDA guidance does not mention the “fraction methods” spoken about by Huang et al. which further illustrates their point that the group of methods which set the NI margin in terms of the difference between the control and experimental treatment are more popular to use in MANI trials.

The fixed margin approach is used as the basis for the hypothesis testing methods for MANI trials seen in papers from Huang et al. [ 7 ], Kwong et al. [ 8 , 9 ] and Zhong et al. [ 12 ]. These methods are spoken about at greater length in later sections of this paper and in Appendix B . The synthesis and fraction based methods were only briefly mentioned when talking about the set-up of hypothesis testing in the MANI trial methodological literature.

Increasing efficiency

A common issue with non-inferiority trials is that they generally require a large sample size, especially in cases where the non-inferiority margin is close to the estimated effect of the control treatment. Kwong et al. [ 9 ] explore optimisation of sample size within their paper, with an algorithm that searches numerically for suitable allocation ratios for the experimental and placebo arms that reach the required any-pair power, enforcing that neither can have a greater allocation than the control arm. They then reduce the total sample size until it is no longer possible to reach the required power.

For MANI trials designed to identify the minimal effective duration required from a number of experimental arms, Quartagno et al. [ 11 ] advocate modelling the entire duration-response curve and allocating patients to different durations of treatment. They argue that doing this avoids the potential issues caused when selecting non-inferiority margins as well as reducing overall sample size. They also carried out simulations in order to assess the ability of their duration response curves to accurately reflect the true response curve when using different numbers of duration arms, different increments in duration and different flexible regression strategies in the modelling process in order to make recommendations as to their optimal modelling strategy in different trial scenarios.

Simultaneous confidence interval compatibility with p value adjustment for controlling family-wise error rate

While it is possible to adjust for multiple testing within the setup of the trial hypotheses, it may not always be practical or within the frame of interest for the study to do so. In some cases, adjustments will be made at the analysis stage of the trial. There are many methods available for adjusting an analysis to account for potential inflation of the family-wise error rate; however, when considering MANI trials, issues can arise when implementing some of these methods, due to the results which are used to make inferences in such trials.

Superiority trials often make statistical inferences from p values while inferences from non-inferiority trials are conventionally taken from confidence intervals (CIs). In a standard two-arm randomised controlled trial where no multiple testing is present, it is easily possible to obtain both a p value and a CI for the treatment difference, regardless of the framework of the trial, and the conclusions with regard to efficacy of the treatment will agree. However, when multiple testing is involved, there is less information available on methods of adjusting a confidence interval for multiplicity than there is for adjusting p values [ 2 ].

In practice, for single-step adjustment scenarios, in the case of p values, adjustments are made by first calculating a simple p value from a given hypothesis and then comparing it to an adjusted critical value (or by making an equivalent adjustment to the p value and comparing it to a standard critical value). This can easily be translated into creating adjusted CIs (whether for treatment difference on a continuous scale, log odds ratio or log hazards ratio etc.) using a standard confidence interval formula with an altered critical value. This can be done for Bonferroni testing and for Dunnett tests (an adjusted version of this method is implemented in Hasler and Hothorn’s paper [ 5 ]).

In stepwise cases, that is, for methods such as the Holm or Hochberg procedure where p values are ordered and tested sequentially against increasing or decreasing critical values, it is not easy to achieve correspondence between an adjusted p value and an adjusted CI, with simple ad hoc methods of creating corresponding adjusted CIs shown to lose overall coverage [ 23 ]. Most adjusted CIs are all calculated together and so are more commonly referred to as simultaneous CIs (SCIs). There are contrasting messages in the literature (summarised below) with some saying that for stepwise adjustment methods, SCIs should be avoided while others have made attempts at creating approximations to stepwise p value adjustment methods for SCIs such as the Holm, Hochberg and Hommel procedures. These methods will be outlined in further detail in later sections.

A good alternative to the outlined single step and stepwise procedures is to use parametric methods of adjustment such as Dunnett t testing and Tukey’s honestly significant difference [ 24 ]. These methods provide the added advantage of taking the correlation of the tests, induced by the shared control group, into account while non-parametric methods usually assume tests are carried out independently of one another and can become more conservative when this is not the case. They are designed for continuous, normally distributed data and operate by adjusting the method to calculate the standard error such that the FWER can be strongly controlled. This allows for the creation of critical values that can be used to calculate SCIs which are less conservative than those created using Bonferroni-based critical values. Tukey’s method has the added advantage of being able to look at pairwise comparisons between treatments while Dunnett’s method is designed only for comparisons between treatment arms and a shared control treatment. Parametric step-up procedures also exist [ 25 ] which are even more powerful than the previously mentioned methods but suffer the same issue with lack of correspondence when creating SCIs as the non-parametric stepwise methods.

When looking at multiple testing procedures from a superiority perspective, Phillips and an expert group [ 26 ] state that the majority of discussants felt that using unadjusted confidence intervals is preferable when reporting results. It was felt that when choosing a multiple testing procedure a “hypothesis test should take preference” if a corresponding method is not available for calculating an SCI. The reason given is that formal hypothesis testing to establish an effect and creating a CI is different to creating a CI based on a previously established effect from a hypothesis test. For non-inferiority and equivalence studies, the expert group concluded, “compatible simultaneous CIs for the primary endpoint(s) should be presented in all cases”. This is because in these studies, CIs are typically used to make inferences regarding the hypotheses. The authors regard compatibility (that is, the SCI having the same conclusion as the adjusted p value) with the multiple testing procedure as “crucial”.

Channon [ 27 ] is of the opinion that the conclusion from CIs should always match those of p values. Although outlining some methods of creating SCIs, he recommends that step-down multiple test procedures “should not be used in circumstances where the confidence interval is the primary outcome”. This is reasoned using quotes from Hochberg and Tamhane [ 28 ] who say, “if confidence estimates of pairwise differences are desired then the only option is to use one of the single-step procedures”. The examples given within the paper are within a superiority setting; however, the comments made about the use of step-down procedures are applicable to other trial settings.

A similar conclusion is made by Dmitrienko [ 29 ] with regard to the use of SCIs. He concludes that in practice, it is likely that the sponsor would have to use unadjusted CIs for the treatment parameters rather than SCIs if using a stepwise multiple testing procedure to adjust for multiplicity.

It is possible to create approximate SCIs that closely correspond to results from powerful stepwise p value adjustment procedures; however, when attempting to do so, there are implications of a trade-off between complexity and utility. Dmitrienko [ 29 ] provides an example of a simple SCI that corresponds to the step-down Holm procedure; however, these intervals are said to be “completely non-informative” with regard to providing information on the parameter values. Efird and Nielsen [ 30 ] provide a simple method for calculating SCIs based on the Hochberg procedure for log odds ratios.

In order to improve the accuracy and utility of approximated SCIs, more complex methodology is currently under development. Guilbaud [ 31 , 32 ] has developed simultaneous confidence regions that correspond to Holm, Hochberg and Hommel testing procedures. Although these methods are in development and will likely only continue to improve, the complexity of some of these methods may limit how often they will be applied in practice. The decision as to whether or not to use SCIs or an alternative method of analysis should be considered and justified accordingly on an individual trial basis.

Accounting for heterogeneous treatment variances

Another way in which FWER can be inflated or deflated when analysing MANI trials is if treatment variances are heterogeneous. Many methods of analysis and sample size calculations can assume that variances are homogeneous which, while appropriate in some trial scenarios, may not be appropriate for others. Huang et al. [ 7 ] outline the effect of inappropriately assuming homogeneity on the FWER and introduce two alternative procedures that account for heterogeneity (these are outlined in more detail in Appendix B ). However, these methods use hypothesis tests and it is not made clear whether corresponding SCIs can be created from them and whether they would reach the desired level of coverage. Thus, this may feed further into the discussion around whether it is preferable to base inferences on SCIs or on p value/hypothesis test based analysis methods.

Maintaining sufficient power when strongly controlling FWER

The draft FDA guidance on non-inferiority trials [ 33 ] came under criticism from a group of European statisticians who were concerned that there was an imbalance between the recommendations on controlling the overall type I error and ensuring that type II error was also protected [ 1 ]. It follows that by strongly controlling FWER, and using conservative tests, there is a possibility that false hypotheses may not be rejected. This is partly addressed by implementing more powerful testing procedures, which enforces the importance of creating suitable SCIs corresponding to powerful p value adjustment methods. Hommel and Bretz [ 34 ] warn that there is an element of trade-off between increased power and reproducibility for different multiple test procedures so this also needs to be taken into consideration on a case-by-case basis. The other common but often less popular method of increasing power is to increase the sample size.

MANI trials in practice

In this section, we move on to summarise the current conduct of past and present MANI trials and assess whether the issues outlined in the previous sections are presenting themselves frequently in practice and how well they are being addressed when they arise. Where areas are not being addressed or are not being addressed sufficiently, we will try to identify possible reasons as to why this may be and formulate recommendations on how to improve how issues are dealt with.

Practical examples of MANI trials

Across the four databases searched, 65 examples of MANI trials have been found to date. The original search presented 137 papers, once repeats were removed. Of these, 85 were removed, the majority of which were papers relating to the same trials or abstracts of trials that were already included (28). Twenty-six trials did not have multiple experimental arms and four either did not test both experimental arms for non-inferiority or only included non-inferiority testing as an exploratory measure. Sixteen publications were excluded as they were Cochrane reviews of multiple trials in a disease area and therefore did not include design details of individual trials. The remaining 11 exclusions were for number of reasons including phase II or pilot studies, trials on animals or publications being inaccessible. Thirteen further trials were found, either from searching for further details on abstracts, through references from other trials and from methodological papers. The breakdown of this search is shown in Fig. 1 .

Of the 65 found, 21 had been carried out across multiple countries, with the remainder taking place in a variety of individual countries around the world and within a variety of disease areas. The most common disease was cancer (nine), with breast, lung, pancreatic and rectal included among the MANI trials in this area. Six diabetes trials were found and five trials were found in HIV and pregnancy respectively, with all other disease areas having four or less trials found. Table 1 gives a summary of the MANI trials and some of their key characteristics and conduct. When assessing whether multiple testing was required, this was counted either based on where trial publications had identified and reasoned the requirement for it or where adjustment was judged to be required based on Howard et al.’s recommendations [ 14 ].

Of the 39 trials identified as requiring adjustment for multiple testing, according to the criteria set out by Howard et al. [ 14 ], 30 (77%) did so. This framework of deciding whether adjustment was required, outlined in the “ Ordering and structuring of hypotheses ” section, was selected as it is straightforward to assess across trials and was utilised in the creation of CONSORT guidance on the topic. However, as also outlined in the “ Ordering and structuring of hypotheses ” section, the topic of adjustment and its requirement in trials has many schools of thought and so this is not the only possible option for assessing the requirement. One trial adjusted for multiple testing despite not being identified as requiring it. Of the adjustment methods implemented, 26 were closed testing procedures or Bonferroni adjustment. Some trials included multiple methods of adjustment for separate endpoints, for example, the LEAD-1 study [ 35 ] used a closed testing procedure to adjust for the primary endpoint while using Dunnett SCIs for the secondary endpoint. Of the ‘other methods’ of adjustment, three used Dunnett SCIs (not shown in the table as all used closed testing procedures for the primary endpoint), two used the Bonferroni-Holm method with p values, two used Tukey’s method of multiple comparisons (one not shown in the table as Bonferroni adjustment used for the primary endpoint), one used a Hochberg based SCI method and one used a Lan DeMets alpha spending function as there were multiple stages within the trial.

Fifteen trials tested for superiority after proving non-inferiority, 11 of these trials implemented a closed testing procedure while four either chose to use Bonferroni adjustment, or chose not to adjust at all. Other trials which implemented closed testing procedures did so when deciding whether to test further arms based on the success of previous treatment arms. One such example of this is the trial carried out by Bachelez et al. [ 36 ] where two different doses of an experimental treatment, tofacitinib, against a control treatment, etanercept and a placebo. In this trial, a fixed sequence procedure was outlined where the higher dose of the experimental treatment was first to be tested for non-inferiority against the control, then if successful, for superiority against the placebo. If this was shown, both steps were to be repeated for the lower dose of the experimental treatment before finally testing the larger, then smaller dose for superiority against the control treatment. As it happened, the higher dose met the non-inferiority and superiority criteria against the control and placebo respectively but the lower dose did not meet the criteria for non-inferiority against the control which meant that no further hypotheses could be tested.

Heterogeneity of treatment variance was not considered in any of the MANI trials that were investigated. Trials generally assumed homogenous variances or did not mention the treatment variance within the publication. Only one trial mentioned the use of a dynamic sample size calculation. This was due to a lack of availability of a sample size formula for their endpoint; Kroz et al. therefore utilised a ‘marginal modelling approach for correlated parameters based on general estimation equations’ [ 37 ] which was developed by Rochon [ 38 ]. They state that due to their underestimation of dropout levels, future trials in that disease area should have larger sample sizes.

Of the trials assessed, only seven included a placebo arm. The reasons for not including a placebo generally were due to ethical issues or recruitment concerns. In many cases, historical data had previously indicated the efficacy of control treatments. The majority of trials chose to implement what most closely resembled a fixed margin approach as outlined within the FDA guidelines. Some did not define an expected effect of the control treatment and only defined a non-inferiority margin, often with minimal explanation as to the decision behind the choice of margin. Of the four trials that are said to have not used it, there was no clear indication of what design they implemented.

Statistical considerations of MANI trials in practice

Table 1 shows that while current and previous MANI trials deal with some of the statistical issues identified in this paper, there are issues that either rarely arise, or are rarely addressed. In general, there was an indication that trials showed a preference towards assumptions and methods that would keep design and analysis simple. In many cases, the failure to be explicit in providing reasoning around design choices may have been due to a lack of awareness of the issues raised within the “ Statistical considerations for MANI trials ” section.

The trials identified generally set up their hypotheses in a manner similar to the fixed margin approach outlined by the FDA in their guidance [ 19 ]. The majority did not have placebo arms and did not define an expected effect of the control arm. The trials that did use a placebo stated that they used pre-specified non-inferiority margins that were given in terms of the endpoint rather than as a percentage of the control treatment. Despite failing to define all quantities used in the FDA’s definition of a fixed-margin design, these values may have been known and used implicitly when setting a non-inferiority margin in many of the MANI trials and therefore they have been counted as fixed-margin designs. No trials used the fraction-based approaches mentioned by Huang [ 7 ]; this is likely due to the requirement of a placebo arm in calculating an overall test statistic, while the fixed margin approach uses a separate assay sensitivity hypothesis which may not require testing if suitable evidence exists to suggest efficacy in the active control arm.

Almost a quarter of the trials included the option to test for superiority once non-inferiority had been concluded The majority of these trials utilised a closed testing procedure in order to adjust for multiple testing. This seems to be a strong and simple approach to follow when superiority is of interest and is clearly far more efficient than carrying out separate trials to test non-inferiority and superiority.

While power was mentioned in almost every trial, almost half did not explicitly state or strongly imply which type was used. Three trials did consider two types of power, looking at power for individual tests before giving an ‘overall’ power. In some cases, it may have been assumed that the choice would either be known or could be inferred.

The closest that any trial came to sample size optimisation was the dynamic sample size calculation summarised in the “ Practical examples of MANI trials ” section, for which the resultant sample size was seen to provide insufficient power to provide strong conclusions for the trial. This method was implemented due to a lack of availability of a sample size calculation for the chosen testing strategy rather than to reduce the overall sample size. There is potential for sample size optimisation to be useful within MANI trials, particularly where funding is an issue as non-inferiority trials can require greater sample sizes than other trials. However, its utility must be taken into consideration within the context of the trial, taking into account whether it is key to reduce sample size and how confident investigators are in their estimates of treatment effect and variance.

It was noted that there were no trials that mentioned having heterogeneous treatment variances or that mentioned carrying out any kind of assessment as to whether an assumption of variance homogeneity was justified. While it may be possible that all 65 trials were in areas where treatment variances are all homogeneous, it could be considered unlikely that is the case. Without knowledge of the treatments and the disease areas, it is difficult to know whether the assumption of homogeneous variances (either explicitly or implicitly stated within the trial publications) is fair. The publications outlining the available analysis methods mentioned in the heterogeneity section have not been cited in any MANI trial publications to date which could mean that either they are not well enough known by those carrying out such trials or that there has simply not been a need for them to date. Nonetheless, this is an area that may require more careful consideration in trials going forward.

Adjustment for multiple testing was required for more than half of the trials found. While the majority of trials that required it implemented at least one method of adjustment, just over 20% of trials that were identified to require adjustment failed to do so, with none of them providing a reason for not adjusting. This is also the situation with the choice of power, which may potentially be due to a lack of awareness of the potential requirement for adjustment or due to the investigators believing that the reasoning could be inferred. It may be that in cases where multiple doses of the same treatment are being tested, investigators see these as separate hypotheses rather than as part of the same family of hypotheses. If this is the case and each arm was tested in its own right and did not provide an indication of overall efficacy of a treatment, then adjustment would not be required.

When considering adjustment for multiple testing in the analysis phase, the choices of adjustment method reflected the preference to utilise simple, well known methods over more powerful but more complex methods. While many trials used Bonferroni adjustment, there were five other methods of adjustments implemented across nine trials. Some trials chose to implement more powerful p value adjustment methods rather than creating less powerful SCIs while others chose to use Dunnett SCIs which are more powerful than Bonferroni adjustment but more complex to understand. While the figures showed a preference for utilising CIs rather than switching to p values when analysing MANI trials, the majority sacrificed power to detect non-inferiority to do this.

Although it appears that trial teams prefer to implement simple adjustment methods within the analysis phase, in a similar manner to adjustment within the design process, trials generally do not justify their choice of adjustment method so it is difficult to know the decision process when selecting one. When choosing an adjustment method, it is vital to take the context of the trial into account. If a small number of treatments are being tested, raising the chances of a type I error by a small amount, then the conservativeness of simple adjustment methods such as Bonferroni may be acceptable in order to take advantage of its simplicity to implement. In cases where differences between treatment effects are expected to be small, or only just above the non-inferiority margin, more powerful adjustment methods may be required to ensure that where truly non-inferior treatments exist, they are found. In this case, it would be at the discretion of the investigator as to whether they would prefer the exact results offered by p values over the ease of interpretation offered by approximate SCIs.

Recommendations

In this section, we provide specific recommendations for the issues raised earlier within the paper that must be considered when carrying out MANI trials. We further give examples from previous MANI trial publications where issues have appeared and been addressed appropriately. It is important to note that issues should be considered within the context of each trial and thus these recommendations may not provide a completely exhaustive summary.

Clear definition of the structure of all hypotheses

It is vital to be clear within the design phase about how the hypotheses of a MANI trial are to be structured. The best choice of structuring is dependent on the aims of the trial and which hypotheses are of the most interest. Carrying out analysis for multiple hypotheses can increase the chance of observing a type I error and this may require addressing if hypotheses are used together to form a single claim of efficacy for a treatment. The decision as to whether or not adjustment is required and, if applicable, whether this adjustment will be done within the design of the trial or when carrying out the analysis should be clearly outlined and reasoned within any trial documentation (e.g. the protocol and statistical analysis plan). It is particularly important to outline reasons for not adjusting if it is not adjudged to be necessary.

If adjustment is adjudged to be required, it can be addressed within the hypothesis setup by implementing a closed testing procedure or similar sequential testing approach where certain hypotheses are only tested upon the rejection of a previous hypothesis. In this case, the order in which hypotheses are to be tested should be considered and specified. This is generally useful when testing different doses or testing schedules of the same treatment in a trial, for example, the trial outlined by Bachelez et al. [ 36 ] tested different doses of tofacitinib within a closed testing procedure where the lower doses of the experimental treatment were only tested if the higher doses were seen to be non-inferior to the control.

Where it is of interest to know the results of all hypotheses, then sequential testing procedures are not appropriate to use and adjustment may be required within the analysis phase of the trial. If the treatment arms are not related or the trial hypotheses do not form a single claim of efficacy for a treatment then adjustment for multiple testing may not be required as outlined in the “ Ordering and structuring of hypotheses ” section.

Clear definition of the choice of power type

The choice of power type is strongly linked to the choice of hypothesis structure. If all hypotheses are of identical interest within a trial, then the power used should be the all-pair power as outlined in the “ Ordering and structuring of hypotheses ” section. If only one arm is of interest, then either the any-pair power (probability of correctly rejecting at least one false hypothesis) or the per-pair power (probability of correctly rejecting a specified hypothesis) could be utilised. It is important to be clear in the choice of power or to define multiple powers if they are of interest. The MAGENTA trial [ 39 ] provides a good example of defining the two different powers they implemented within their trial sample size calculation. In this trial protocol paper, both the any-pair power and all-pair power is defined (expressed as power for each test and overall power).

A priori specification of any change in hypothesis type

It is possible to include a change in hypothesis type in a MANI trial; generally, this is testing for superiority once non-inferiority has been proven. The easiest way to do this is to set up a closed testing procedure or similar sequential testing procedure, as is seen in the BRIGHTER trial [ 40 ] where for the key secondary endpoint, the experimental treatment was tested for superiority once non-inferiority was concluded, as part of a closed testing procedure. When including a hypothesis change in a trial, it is important to specify this before the trial is carried out in accordance with the Committee for Proprietary Medicinal Products guidance on switching between superiority and non-inferiority [ 18 ]. This guidance outlines the general considerations potentially required when switching hypotheses for MANI trials and non-inferiority trials that fall outside the definition of a MANI trial.

Clear definition of the choice of non-inferiority margin and subsequent analysis method

In the “ Choice of non-inferiority margin definition and subsequent analysis method when carrying out MANI trials ” section, several methods for defining the non-inferiority margin and subsequent analysis methods were outlined. Trials generally chose to utilise the well-known fixed-margin approach. The CONCENTRATE trial [ 41 ] implemented this approach and provided a thorough explanation as to how they had selected their non-inferiority margin based on results seen in previous studies in the same disease area. Yuan et al. [ 42 ] also provide a shorter but sufficiently clear definition of their non-inferiority margin and its derivation. This is a perfectly adequate method to use and has the added advantage of not requiring the inclusion of a placebo arm which is not true for all methods. The specially designed MANI versions of the fixed margin approach outlined in the “ Choice of non-inferiority margin definition and subsequent analysis method when carrying out MANI trials ” section and detailed further in Appendix B may be worth consideration as they can include adjustments for multiple testing and can provide an increase in power for tests. The normal fixed margin approach should be considered the standard for MANI trials with explanation only required if a different approach is chosen.

A priori specification of the choice of adjustment method for multiple testing (where required) within a MANI trial analysis with reasoning

If it is determined that adjustment for multiple testing is required at the analysis stage of a MANI trial (see recommendations on ordering and structuring of hypotheses for when this may be the case), then the choice for whether to analyse using SCIs or adjusted p values lies with the trial team. It may be preferable to implement two types of adjustment in a trial; for example, LEAD-1 [ 35 ] implemented Dunnett testing for the primary endpoint of their trial and Bonferroni testing for the safety endpoints, providing clear explanations for each. The decision as to which adjustment methods are implemented should be outlined in trial documents at least briefly, with reasoning given for choice of method if potentially unclear.

One of the strongest available options for carrying out adjustment is to use parametric methods such as Dunnett t testing and Tukey’s honestly significant difference. These methods provide correspondence for p values and SCIs which is not true for the stepwise adjustment methods outlined in the “ Simultaneous confidence interval compatibility with p value adjustment for controlling family-wise error rate ” section and are more powerful than the single-step methods of adjustment such as Bonferroni testing, which addresses some of the concerns outlined in the “ Maintaining sufficient power when strongly controlling FWER ” section. They also take account of potential correlation induced by having a shared control group while many single-step and stepwise procedures assume tests are carried out independently.

The main reasons for which it may be preferable to utilise a single-step method such as Bonferroni over the parametric methods would be simplicity of implementation and a preference towards being conservative when making inferences about treatments. If using a single-step approach, it is preferable to create adjusted CIs as these provide more detail on the uncertainty of estimates and exact numerical values of treatment differences compared to p values.

If the context of the trial dictates that stepwise adjustment techniques should be implemented, based on current research, it is preferable to use adjusted p values or test statistics and critical values, as outlined in the “ Simultaneous confidence interval compatibility with p value adjustment for controlling family-wise error rate ” section and Appendix B . Although methods for creating approximate corresponding SCIs are available, their complexity added to the availability of alternative methods for creating SCIs means that they may not be deemed worthwhile to implement.

Consideration of the potential requirement for adjustment for heterogeneous treatment variances

The decision as to whether adjustment is required within a trial analysis for heterogeneous treatment variances should be considered. Its requirement will entirely depend on the disease and treatment area and the confidence for which an assumption of homogeneous treatment variance can be made. If the treatment variances are assumed homogenous, this should be stated with reasoning within any trial publications where treatment analysis will take place, with a full understanding of the implications of falsely making such an assumption. Where the assumption is not appropriate, alterations to the analysis methods should be made that take variance heterogeneity into account such as not using pooled variance estimators otherwise there is a risk of inflation or deflation of the FWER and potential loss of power. Appropriate methods of accounting for heterogeneous treatment variances are signposted in the “ Accounting for heterogeneous treatment variances ” section and within Appendix B .

Consideration of the potential use of sample size optimising techniques

This is an optional consideration that can be implemented within the design phase of a MANI trial, potentially after the definition of non-inferiority margin and subsequent analysis method. Non-inferiority trials can require a large number of participants and thus it may be of interest to try to reduce the overall sample size of the trial, potentially with adjusted allocation ratios. Kwong et al. [ 9 ] provide some examples of carrying out sample size optimisation in MANI trials in their paper. If carrying out sample size optimisation, it is important to ensure that a sufficient level of power is maintained within the trial.

In this paper, we have sought to identify some of the statistical issues that can occur when designing and analysing Multi-Arm Non-Inferiority (MANI) trials, in order to provide advice on considerations that should be made and to recommend the most suitable methods to deal with the issues where possible.

From our review of current MANI trials, it is clear that with respect to the issues we identified, the obvious first area for improvement comes in defining and explaining the choices made for the design and analysis of trials in a clear and concise manner. Explanations may be more difficult to include in publications due to word counts, but this can be included briefly, referenced or included in an appendix. This point applies to all considerations outlined within the “ Statistical considerations for MANI trials ” section.

One issue that is of key importance throughout the methodology is control of the FWER. It presents itself both in the design and the analysis stage of a trial. The design of the trial and structure of hypotheses determine whether adjustment is required in the first place and the actual adjustment process needs to consider choice of method and assumptions made around the data. This is an element that requires reflection on the debate surrounding this topic when considering whether to adjust. It is worth noting that while the review of MANI trials in this paper found that of a number were adjudged to require adjustment, many had not done so, this was based on applying the philosophy of one of the schools of thought in this area and the numbers could have been different had a different philosophy been applied.

There were mixed messages across the literature in some areas, particularly regarding adjustment for multiple testing at the design phase with compatibility issues between SCIs and adjusted p values and the potential advantages and disadvantages of using each within an analysis. Further research may be required into the development of SCIs that correspond to common stepwise p value adjustment methods, particularly in creating methods where complexity does not potentially inhibit the ability to apply them widely in trial research. However, the parametric methods provide a strong alternative to stepwise methods and are relatively easy to apply.

The number of MANI trials found in the literature search illustrated that while there is certainly a place for such trials in modern research. Their benefits in terms of increasing efficiency in finding treatments that may be more cost effective, easier to administer or safer are clear. Their use could increase in the future, particularly as a greater number of treatments become available in disease areas and endpoints such as quality of life are becoming of greater importance. The reason for their limited implementation to date may be for a number of reasons. It may be that lack of familiarity with the design or lack of available guidance for how to conduct MANI trials discourages investigators from using them, or it may be that general issues around multi-arm trials and non-inferiority trials such as increased sample sizes may discourage funding bodies from financing such trials.

The methodological literature search illustrated that MANI trials are an area where there has not previously been a large amount of research with little definitive guidance available on how to run a MANI trial and on any statistical considerations that need to be made further to the standard considerations for multi-arm and non-inferiority trials individually. The small number of publications about methodology within MANI trials indicates that there may be room for further exploration of the topic. It was the aim of this review to keep the search terms relatively wide in order to ensure that no publications around MANI trials would be missed. However, a potential limitation is the number of publications that were found from looking at references and citations of publications rather than from the database searches themselves. This could indicate that there may have been publications that could have been missed in the search process which could have added to the scope of this paper.

The trials analysed in this paper appeared to prefer simple design and analysis choices and methods which while not necessarily an issue on an individual basis will not always be suitable for future trials. With potential to be applied on a wider basis going forward, it is important that guidance is available so that those conducting MANI trials are aware of any potential issues that they need to be aware of and how to address them. This paper aims to provide a good insight into the issues and provide a good resource to refer to when setting up a new MANI trial.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Abbreviations

Confidence interval

European Medicines Agency

Food and Drug Administration (US)

Family-wise error rate

Intersection union

Multi-arm non-inferiority

National Health Service

Non-inferiority

National Institute of Health Research

Simultaneous confidence interval

Union-intersection

Huitfeldt B, Hummel J, E.F.o.S.i.t.P. Industry. The draft FDA guideline on non-inferiority clinical trials: a critical review from European pharmaceutical industry statisticians. Pharm Stat. 2011;10(5):414–9. https://doi.org/10.1002/pst.508 .

Article PubMed Google Scholar

Rohmel J, Pigeot I. A comparison of multiple testing procedures for the gold standard non-inferiority trial. J Biopharm Stat. 2010;20(5):911–26. https://doi.org/10.1080/10543401003618942 .

European Medicines Agency. Committee for Medicinal Products for Human Use (CHMP) guideline on the choice of the non-inferiority margin. Stat Med. 2006;25(10):1628.

Article Google Scholar

Chang M. Multiple-arm superiority and non-inferiority designs with various endpoints. Pharm Stat. 2007;6(1):43–52. https://doi.org/10.1002/pst.242 .

Article PubMed CAS Google Scholar

Hasler M, Hothorn LA. Simultaneous confidence intervals on multivariate non-inferiority. Stat Med. 2012;32(10):1720–9. https://doi.org/10.1002/sim.5633 .

Hasler M. Multiple comparisons to both a negative and a positive control. Pharm Stat. 2012;11(1):74–81. https://doi.org/10.1002/pst.503 .

Huang LC, Wen MJ, Cheung SH. Noninferiority studies with multiple new treatments and heterogeneous variances. J Biopharm Stat. 2015;25(5):958–71. https://doi.org/10.1080/10543406.2014.920346 .

Kwong KS, Cheung SH, Hayter AJ. Step-up procedures for non-inferiority tests with multiple experimental treatments. Stat Methods Med Res. 2016;25(4):1290–302. https://doi.org/10.1177/0962280213477767 .

Kwong KS, Cheung SH, Hayter AJ, Wen MJ. Extension of three-arm non-inferiority studies to trials with multiple new treatments. Stat Med. 2012;31(24):2833–43. https://doi.org/10.1002/sim.5467 .

Lawrence J. Testing non-inferiority and superiority for two endpoints for several treatments with a control. Pharm Stat. 2011;10(4):318–24. https://doi.org/10.1002/pst.468 .

Quartagno M, Walker AS, Carpenter JR, Phillips PPJ, Parmar MKB. Rethinking non-inferiority: a practical trial design for optimising treatment duration. Clin Trials. 2018;15(5):477–88. https://doi.org/10.1177/1740774518778027 .

Article PubMed PubMed Central Google Scholar

Zhong J, Wen MJ, Kwong KS, Cheung SH. Testing of non-inferiority and superiority for three-arm clinical studies with multiple experimental treatments. Stat Methods Med Res. 2018;27(6):1751–65. https://doi.org/10.1177/0962280216668913 .

Juszczak E, Altman DG, Hopewell S, Schulz K. Reporting of multi-arm parallel-group randomized trials: extension of the CONSORT 2010 statement. Jama. 2019;321(16):1610–20. https://doi.org/10.1001/jama.2019.3087 .

Howard DR, Brown JM, Todd S, Gregory WM. Recommendations on multiple testing adjustment in multi-arm trials with a shared control group. Stat Methods Med Res. 2018;27(5):1513–30. https://doi.org/10.1177/0962280216664759 .

Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment, vol. 279: New York: Wiley; 1993.

Stewart WH, Ruberg SJ. Detecting dose response with contrasts. Stat Med. 2000;19(7):913–21. https://doi.org/10.1002/(SICI)1097-0258(20000415)19:7<913::AID-SIM397>3.0.CO;2-2 .

Dmitrienko A, Tamhane AC, Bretz F. Multiple testing problems in pharmaceutical statistics. New York: CRC press; 2009.

Committee for Proprietary Medicinal Products. Points to consider on switching between superiority and non-inferiority. Br J Clin Pharm. 2001;52(3):223–8.

Food US, Administration D. Non-inferiority clinical trials to establish effectiveness: guidance for industry. Silverspring: US Food and Drug Adminstration; 2016.

Google Scholar

Ke C, Ding B, Jiang Q, Snapinn SM. The issue of multiplicity in noninferiority studies. Clin Trials. 2012;9(6):730–5. https://doi.org/10.1177/1740774512455370 .

European Medicines Agency. Choice of control group and related issues in clinical trials E10, vol. 10; 2000.

Pigeot I, Schäfer J, Röhmel J, Hauschke D. Assessing non-inferiority of a new treatment in a three-arm clinical trial including a placebo. Stat Med. 2003;22(6):883–99. https://doi.org/10.1002/sim.1450 .

Strassburger K, Bretz F. Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Stat Med. 2008;27(24):4914–27. https://doi.org/10.1002/sim.3338 .

Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc. 1955;50(272):1096–121. https://doi.org/10.1080/01621459.1955.10501294 .

Dunnett CW, Tamhane AC. Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts. Stat Med. 1991;10(6):939–47. https://doi.org/10.1002/sim.4780100614 .

Phillips A, Fletcher C, Atkinson G, Channon E, Douiri A, Jaki T, et al. Multiplicity: discussion points from the Statisticians in the Pharmaceutical Industry multiplicity expert group. Pharm Stat. 2013;12(5):255–9. https://doi.org/10.1002/pst.1584 .

Channon EJ, McEntegart DJ. Confidence intervals and p-values for Williams' and other step-down multiple comparison tests against control. J Biopharm Stat. 2011;11(1-2):45–63.

Hochberg J, Tamhane AC. Multiple comparison procedures. New York: Wiley; 1987.

Dmitrienko A, D'Agostino R Sr. Traditional multiplicity adjustment methods in clinical trials. Stat Med. 2013;32(29):5172–218. https://doi.org/10.1002/sim.5990 .

Efird JT, Nielsen SS. A method to compute multiplicity corrected confidence intervals for odds ratios and other relative effect estimates. Int J Environ Res Public Health. 2008;5(5):394–8. https://doi.org/10.3390/ijerph5050394 .

Guilbaud O. Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biom J. 2012;54(3):317–42. https://doi.org/10.1002/bimj.201100123 .

Guilbaud O. Simultaneous confidence regions corresponding to Holm's step-down procedure and other closed-testing procedures. Biom J. 2008;50(5):678–92. https://doi.org/10.1002/bimj.200710449 .

Food and Drug Administration. Guidance for industry: non-inferiority clinical trials draft guidance. Silver Spring: FDA; 2010.

Hommel G, Bretz F. Aesthetics and power considerations in multiple testing–a contradiction? Biom J. 2008;50(5):657–66. https://doi.org/10.1002/bimj.200710463 .

Marre M, Shaw J, Brändle M, Bebakar WMW, Kamaruddin NA, Strand J, et al. Liraglutide, a once-daily human GLP-1 analogue, added to a sulphonylurea over 26 weeks produces greater improvements in glycaemic and weight control compared with adding rosiglitazone or placebo in subjects with type 2 diabetes (LEAD-1 SU). Diabetic Med. 2009;26(3):268–78. https://doi.org/10.1111/j.1464-5491.2009.02666.x .

Bachelez H, van de Kerkhof PCM, Strohal R, Kubanov A, Valenzuela F, Lee JH, et al. Tofacitinib versus etanercept or placebo in moderate-to-severe chronic plaque psoriasis: a phase 3 randomised non-inferiority trial. Lancet. 2015;386(9993):552–61. https://doi.org/10.1016/S0140-6736(14)62113-9 .

Kroz M, et al. Impact of a combined multimodal-aerobic and multimodal intervention compared to standard aerobic treatment in breast cancer survivors with chronic cancer-related fatigue - results of a three-armed pragmatic trial in a comprehensive cohort design. BMC Cancer. 2017;17(1) (no pagination):166.

Rochon J. Application of GEE procedures for sample size calculations in repeated measures experiments. Stat Med. 1998;17(14):1643–58. https://doi.org/10.1002/(SICI)1097-0258(19980730)17:14<1643::AID-SIM869>3.0.CO;2-3 .

Rayes N, et al. MAGENTA (Making Genetic testing accessible): a prospective randomized controlled trial comparing online genetic education and telephone genetic counseling for hereditary cancer genetic testing. BMC Cancer. 2019;19(1) (no pagination):648.

Tadayoni R, Waldstein SM, Boscia F, Gerding H, Gekkieva M, Barnes E, et al. Sustained benefits of ranibizumab with or without laser in branch retinal vein occlusion: 24-month results of the BRIGHTER study. Ophthalmology. 2017;124(12):1778–87. https://doi.org/10.1016/j.ophtha.2017.06.027 .

Im DJ, et al. Comparison of coronary computed tomography angiography image quality with high- and low-concentration contrast agents (CONCENTRATE): study protocol for a randomized controlled trial. Trials. 2016;17(1) (no pagination):315.

Yuan Q, et al. Transmuscular quadratus lumborum block versus thoracic paravertebral block for acute pain and quality of recovery after laparoscopic renal surgery: study protocol for a randomized controlled trial. Trials. 2019;20(1) (no pagination):276.

Article CAS Google Scholar

Download references

Acknowledgements

Not applicable

JE is funded by a National Institute of Health Research (NIHR) Research Methods Fellowship. This paper presents independent research funded by the NIHR. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

JMB was supported by Core Clinical Trials Unit Infrastructure from Cancer Research UK (C7852/A25447).

Author information

Authors and affiliations.

Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, LS2 9JT, UK

Jake Emmerson & Julia M. Brown

Department of Mathematics and Statistics, University of Reading, Reading, RG6 6AX, UK

You can also search for this author in PubMed Google Scholar

Contributions

JE undertook this work as part of a National Institute of Health Research (NIHR) Research Methods Fellowship under the supervision of ST and JMB. All authors were involved in drafting and approving the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jake Emmerson .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Literature review search strategies by database

(non-inferior*) or (non?inferior*)

(three arm*) or (four arm*) or (five arm*) or (six arm*) or (seven arm*) or (multi* arm*).mp.

Web of Science

TOPIC: ((“non-inferior*” or “non?inferior*”)) AND TOPIC: ((“three arm*” or “four arm*” or “five arm*” or “six arm*” or “seven arm*” or “multi* arm*”))

Cochrane Library

MeSH descriptor: [Research Design] explode all trees

MeSH descriptor: [Clinical Trials as Topic] explode all trees

(non inferior* or non??inferior*) and (three arm*) or (four arm*) or (five arm*) or (six arm*) or (seven arm*) or (multi* arm*)

This appendix provides further detail around some of the methodological ideas introduced in some of the MANI specific literature mentioned in the first sections. It was not included in the earlier sections in order to ensure that the extra information did not distract from the primary aims of the paper.

Alternative methods of analysing MANI trials have been outlined and developed by Kwong [ 8 , 9 ] and Zhong [ 12 ]. Their methods involve calculating test statistics and comparing them to critical values that they calculate; single-step, step-up and step-down procedures have been developed that can test for both non-inferiority and superiority (upon proving non-inferiority). These methods more closely resemble p value adjustment methods as they involve ordering of ‘most significant’ arms and increasing/decreasing critical values but they do not directly correspond to the stepwise p value adjustment methods that have been mentioned, including Holm and Hochberg. These methods work under the fixed margin hypothesis framework outlined in the FDA guidance [ 19 ] on carrying out non-inferiority analyses and so generally assume the inclusion of a placebo arm within a trial. They are all designed to strongly control the FWER and are compared in their power to detect non-inferiority and superiority which are generally of a satisfactory level. There is no indication within their papers about how these methods compare to p value adjustment methods, however their ability to test for both non-inferiority and superiority within the same framework and their specific gearing towards MANI trials, alongside their simplicity compared to some of the approximate SCI methods, are all positive indicators for the methods. One of the disadvantages of these methods are that making inferences based on test statistics is similar to making them based on p values where they are less easy to understand with regard to the magnitude of the effect of a treatment compared to SCIs.

Kwong’s first paper [ 8 ] outlined an extension to the fixed margin approach outlined in the FDA guidance which allowed it to be utilised in MANI trials and created appropriate test statistics for the single-step procedure that they outlined. Huang et al. [ 7 ] utilised Kwong’s single step procedure as the comparator for the two methods they assessed for accounting for treatment variance heterogeneity when analysing MANI trials. The methods assessed by Huang (both single step) were shown to control FWER well for both homogeneous and heterogeneous treatment variances and maintain a similar level of power to Kwong’s method. These methods were published prior to Kwong’s and Zhong’s papers on stepwise testing procedures and thus far have not been formally compared with the methods from these papers, nor has a stepwise method accounting for treatment variance heterogeneity been formally outlined.

Hasler [ 6 ] considered similar areas to Huang’s paper and derives separate test statistics, suitable for homogeneous or heterogeneous treatment variances, for trials that include a placebo. These were created from a fraction-based hypothesis framework and take multiple testing into account where it is required. Hasler also extends to deriving lower SCIs for the ratios of differences before going on to discuss any pair power and allocation optimisation. There are no comparisons made between the Hasler’s methods and any other method and they are not applied to formal examples to test their ability to control FWER.

While the methods outlined by Hasler and Huang have their individual merits, they both rely on the inclusion of a placebo within a trial. There is also a lack of comparisons between these methods and other available methods in order to assess whether these methods truly perform better in the presence of treatment variance heterogeneity. In a trial setting, this variance heterogeneity may not be tested for, either due to prior knowledge of the treatment arms or due to lack of awareness to the issue and its potential consequences. However, Huang’s results regarding the potential inflation of FWER if it is not taken into account would suggest that it is something that should be considered and tested for.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Emmerson, J., Todd, S. & Brown, J.M. Recommendations for designing and analysing multi-arm non-inferiority trials: a review of methodology and current practice. Trials 22 , 417 (2021). https://doi.org/10.1186/s13063-021-05364-9

Download citation

Received : 17 March 2021

Accepted : 09 June 2021

Published : 26 June 2021

DOI : https://doi.org/10.1186/s13063-021-05364-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Clinical trials
Multiple testing
Family-wise error
Stepwise adjustment
Simultaneous confidence intervals
Heterogeneous variances

ISSN: 1745-6215

Submission enquiries: Access here and click Contact Us
General enquiries: [email protected]

Principles of Clinical Trials: Bias and Precision Control

Randomization, Stratification, and Minimization

Reference work entry
First Online: 20 July 2022
Cite this reference work entry

Fan-fan Yu 3

258 Accesses

The fundamental difference distinguishing observational studies from clinical trials is randomization. This chapter provides a practical guide to concepts of randomization that are widely used in clinical trials. It starts by describing bias and potential confounding arising from allocating people to treatment groups in a predictable way. It then presents the concept of randomization, starting from a simple coin flip, and sequentially introduces methods with additional restrictions to account for better balance of the groups with respect to known (measured) and unknown (unmeasured) variables. These include descriptions and examples of complete randomization and permuted block designs. The text briefly describes biased coin designs that extend this family of designs. Stratification is introduced as a way to provide treatment balance on specific covariates and covariate combinations, and an adaptive counterpart of biased coin designs, minimization, is described. The chapter concludes with some practical considerations when creating and implementing randomization schedules.

By the chapter’s end, statistician or clinicians designing a trial may distinguish generally what assignment methods may fit the needs of their trial and whether or not stratifying by prognostic variables may be appropriate. The statistical properties of the methods are left to the individual references at the end.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Durable hardcover edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Buyse M (2000) Centralized treatment allocation in comparative clinical trials. Applied Clinical Trials 9:32–37

Google Scholar

Byar D, Simon R, Friendewald W, Schlesselman J, DeMets D, Ellenberg J, Gail M, Ware J (1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med 295:74–80

Article Google Scholar

Hennekens C, Buring J, Manson J, Stampfer M, Rosner B, Cook NR, Belanger C, LaMotte F, Gaziano J, Ridker P, Willett W, Peto R (1996) Lack of effect of long-term supplementation with beta carotene on the incidence of malignant neoplasms and cardiovascular disease. N Engl J Med 334:1145–1149

Ivanova A (2003) A play-the-winner type urn model with reduced variability. Metrika 58:1–13

Article MathSciNet Google Scholar

Kahan B, Morris T (2012) Improper analysis of trials randomized using stratified blocks or minimisation. Stat Med 31:328–340

Lachin J (1988a) Statistical properties of randomization in clinical trials. Control Clin Trials 9:289–311

Lachin J (1988b) Properties of simple randomization in clinical trials. Control Clin Trials 9:312–326

Lachin JM, Matts JP, Wei LJ (1988) Randomization in clinical trials: Conclusions and recommendations. Control Clin Trials 9(4):365–374

Leyland-Jones B, Bondarenko I, Nemsadze G, Smirnov V, Litvin I, Kokhreidze I, Abshilava L, Janjalia M, Li R, Lakshmaiah KC, Samkharadze B, Tarasova O, Mohapatra RK, Sparyk Y, Polenkov S, Vladimirov V, Xiu L, Zhu E, Kimelblatt B, Deprince K, Safonov I, Bowers P, Vercammen E (2016) A randomized, open-label, multicenter, phase III study of epoetin alfa versus best standard of care in anemic patients with metastatic breast cancer receiving standard chemotherapy. J Clin Oncol 34:1197–1207

Matthews J (2000) An introduction to randomized controlled clinical trials. Oxford University Press, Inc., New York

MATH Google Scholar

Matts J, Lachin J (1988) Properties of permuted-block randomization in clinical trials. Control Clin Trials 9:345–364

Pocock S, Simon R (1975) Sequential treatment assignment with balancing for prognostic factors in the controlled clinical trial. Biometrics 31:103–115

Proschan M, Brittain E, Kammerman L (2011) Minimize the use of minimization with unequal allocation. Biometrics 67(3):1135–1141. https://doi.org/10.1111/j.1541-0420.2010.01545.x

Article MathSciNet MATH Google Scholar

Rosenberger W, Uschner D, Wang Y (2018) Randomization: the forgotten component of the randomized clinical trial. Stat Med 38(1):1–12

Russell S, Bennett J, Wellman J, Chung D, Yu Z, Tillman A, Wittes J, Pappas J, Elci O, McCague S, Cross D, Marshall K, Walshire J, Kehoe T, Reichert H, Davis M, Raffini L, Lindsey G, Hudson F, Dingfield L, Zhu X, Haller J, Sohn E, Mahajin V, Pfeifer W, Weckmann M, Johnson C, Gewaily D, Drack A, Stone E, Wachtel K, Simonelli F, Leroy B, Wright J, High K, Maguire A (2017) Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with REP65-mediated inherited retinal dystrophy: a randomised, controlled, open-label, phase 3 trial. Lancet 390:849–860

Scott N, McPherson G, Ramsay C (2002) The method of minimization for allocation to clinical trials: a review. Control Clin Trials 23:662–674

Taves DR (1974) Minimization: a new method of assigning patients to treatment and control groups. Clin Pharmacol Ther 15:443–453

Wei L, Durham S (1978) The randomized play-the-winner rule in medical trials. J Am Stat Assoc 73(364):840–843

Download references

Author information

Authors and affiliations.

Statistics Collaborative, Inc., Washington, DC, USA

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan-fan Yu .

Editor information

Editors and affiliations.

Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA

Steven Piantadosi

Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore, MD, USA

Curtis L. Meinert

Section Editor information

Department of Medicine, University of Alabama, Birmingham, AL, USA

O. Dale Williams

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry.

Yu, Ff. (2022). Principles of Clinical Trials: Bias and Precision Control. In: Piantadosi, S., Meinert, C.L. (eds) Principles and Practice of Clinical Trials. Springer, Cham. https://doi.org/10.1007/978-3-319-52636-2_211

Download citation

DOI : https://doi.org/10.1007/978-3-319-52636-2_211

Published : 20 July 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-52635-5

Online ISBN : 978-3-319-52636-2

eBook Packages : Mathematics and Statistics Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Lesson 8: Treatment Allocation and Randomization

Treatment allocation in a clinical trial can be randomized or nonrandomized. Nonrandomized schemes, such as investigator-selected treatment assignments, are susceptible to large biases. Even nonrandomized schemes that are systematic, such as alternating treatments, are susceptible to discovery and could lead to bias. Obviously, to reduce biases, we prefer randomized schemes. Credibility requires that the allocation process be non-discoverable . The investigator should not know what the treatment will be assigned until the patient has been determined as eligible. Even using envelopes with the treatment assignment sealed inside is prone to discovery.

Randomized schemes for treatment allocation are preferable in most circumstances. When choosing an allocation scheme for a clinical trial, there are three technical considerations:

reducing bias;
producing a balanced comparison;
quantifying errors attributable to chance.

Randomization procedures provide the best opportunity for achieving these objectives.

Identify three benefits of randomization
Distinguish simple randomization from constrained randomization.
State the purpose of randomization in permuted blocks.
State the objective of stratified randomization.
Contrast the benefits of permuted blocks to those of adaptive randomization schemes.
Use a SAS program to produce a permuted blocks randomization plan.
Use an allocation ratio that will maximize statistical power in the situation where greater variability is expected in one treatment group than the other.
Provide the rationale against randomizing prior to informed consent.

8.1 - Randomization

In some early clinical trials, randomization was performed by constructing two balanced groups of patients and then randomly assigning the two groups to the two treatment groups. This is not always practical as most trials do not have all the patients recruited on day one of the studies. Most clinical trials today invoke a procedure in which individual patients, upon entering the study, are randomized to treatment.

Randomization is effective in reducing bias because it guarantees that treatment assignment will not be based on the patient's prognostic factors. Thus, investigators cannot favor one treatment group over another by assigning patients with better prognoses to it, either knowingly or unknowingly. Procedure selection bias has been documented to have a very strong effect on outcome variables.

Another benefit of randomization which might not be as obvious is that it typically prevents confounding of the treatment effects with other prognostic variables. Some of these factors may or may not be known. The investigator usually does not have a complete picture of all the potential prognostic variables, but randomization tends to balance the treatment groups with respect to the prognostic variables.

Some researchers argue against randomization because it is possible to conduct statistical analysis, e.g., analysis of covariance (ANCOVA), that adjusts for the prognostic variables. It always is best, however, to prevent a problem rather than adjust for it later. In addition, ANCOVA does not necessarily resolve the problem satisfactorily because the investigator may be unaware of certain prognostic variables and because it assumes a specific statistical model that may not be correct.

Although randomization provides great benefit in clinical trials, there are certain methodological problems and biases that it cannot prevent. One example where randomization has little, if any, the impact is external validity in a trial that has imposed very restrictive eligibility criteria. Another example occurs with respect to assessment bias, which treatment masking and other design features can minimize. For instance, when a patient is asked "how do you feel?" or "how bad is your pain?" to describe their condition the measurement bias is introduced.

Simple Randomization

The most popular form of randomization is simple randomization. In this situation, a patient is assigned a treatment without any regard for previous assignments. This is similar to flipping a coin - the same chance regardless of what happened in the previous coin flip.

One problem with simple randomization is the small probability of assigning the same number of subjects to each treatment group. Severe imbalance in the numbers assigned to each treatment is a critical issue with small sample sizes.

Another disadvantage of simple randomization is that it can lead to an imbalance among the treatment groups with respect to prognostic variables that affect the outcome variables.

For example, suppose disease severity in a trial is designated as mild, moderate, and severe. Suppose that simple randomization to treatment groups A and B is applied. The following table illustrates what possibly could occur.

The moderate is fairly well balanced, the mild and severe groups are much more imbalanced. This results in Group A getting more of the severe cases and Group B more of the mild cases.

8.2 - Constrained Randomization

Randomization in permuted blocks is one approach to achieve balance across treatment groups. The randomization scheme consists of a sequence of blocks such that each block contains a pre-specified number of treatment assignments in random order. The purpose of this is so that the randomization scheme is balanced at the completion of each block. For example, suppose equal allocation is planned in a two-armed trial (groups A and B) using a randomization scheme of permuted blocks. The target sample size is 120 patients (60 in A and 60 in B) and the investigator plans to enroll 12 subjects per week. In this situation, blocks of size 12 are natural, so the randomization plan looks like

Each week the patients are assigned a treatment based on a randomly assigned option specified for the week. Notice that there are exactly six As and six Bs within each block, so that at the end of each week there is balance between the two treatment arms. If the trial is terminated, say after 64 patients have been enrolled, there may not be exact balance but it will be close .

Ordinarily, a natural block size is not evident, so logistical procedures may suggest a block size. A variation of blocked randomization is to use block sizes of unequal length. This might be helpful for a trial where the investigator is unmasked. For example, if the investigator knows that the block size is six, and within a particular block treatment A already has been assigned to three patients, then it is obvious that the remaining patients in the block will be assigned treatment B. If the investigator knows the treatment assignment prior to evaluating the eligibility criteria, then this could lead to procedure selection bias. It is not good to use a discoverable assignment of treatments. A next step to take would be to vary the block size in order to keep the investigator's procedure selection bias minimized.

To illustrate that randomization with permuted blocks is a form of constrained randomization, let $N_A$ and $N_B$ denote the number of As and Bs, respectively, to be contained within each block. Suppose that when an eligible patient is ready to be randomized there are $n_A$ and $n_B$ patients already randomized to groups A and B, respectively. Then the probability that the patient is randomized to treatment A is:

$Pr[A]=\begin{cases} 0 & \text{ if } n_A=N_A \\ \dfrac{N_A-n_A}{N_A+N_B-n_A-n_B}& \text{ if } 0<n_A<N_A \\ 1 & \text{ if } n_B=N_B \end{cases}$

This probability rule is based on the model of $N_A$ "A" balls and $N_B$ "B" balls in an urn or jar which are sampled without replacement. The probability of being assigned treatment A changes according to how many patients already have been assigned treatment A and treatment B within the block.

As an example, suppose each block is supposed to have $N_A = N_B = 6$ and $n_A = 3$ and $n_B = 2$ already have been assigned. Thus, there are $N_A - n_A = 3$ A balls left in the urn and $N_B - n_B = 4$ B balls left in the urn, so the probability of the next eligible patient being assigned treatment A is 3/7.

8.3 - Stratified Randomization

Another type of constrained randomization is called stratified randomization. Stratified randomization refers to the situation in which strata are constructed based on values of prognostic variables and a randomization scheme is performed separately within each stratum. For example, suppose that there are two prognostic variables, age and gender, such that four strata are constructed:

The strata size usually vary (maybe there are relatively fewer young males and young females with the disease of interest). The objective of stratified randomization is to ensure balance of the treatment groups with respect to the various combinations of the prognostic variables. Simple randomization will not ensure that these groups are balanced within these strata so permuted blocks are used within each stratum are used to achieve balance.

If there are too many strata in relation to the target sample size, then some of the strata will be empty or sparse. This can be taken to the extreme such that each stratum consists of only one patient each, which in effect would yield a similar result as simple randomization. Keep the number of strata used to a minimum for good effect.

8.4 - Adaptive Randomization

Adaptive randomization refers to any scheme in which the probability of treatment assignment changes according to assigned treatments of patients already in the trial. Although permuted blocks can be considered as such a scheme, adaptive randomization is a more general concept in which treatment assignment probabilities are adjusted.

One advantage of permuted blocks over adaptive randomization is that the entire randomization scheme can be determined prior to the onset of the study, whereas many adaptive randomization schemes require recalculation of treatment assignment probabilities for each new patient.

Urn models provide some approaches for adaptive randomization. Here is an exercise that will help to explain this type of scheme. Suppose that there is one "A" ball and one "B" ball in an urn and the objective of the trial is the equal allocation between treatments A and B. Suppose that an "A" ball is blindly selected so that the first patient is assigned treatment A. Then the original "A" ball and another "B" ball are placed in the urn so that the second patient has a 1/3 chance of receiving treatment A and a 2/3 chance of receiving treatment B. At any point in time with $n_A$"A" balls and $n_B$"B" balls in the urn, the probability of being assigned treatment A is $\dfrac{n_A}{(n_A+ n_B)}$. The scheme changes based on what treatments have already been assigned to patients.

This type of urn model for adaptive randomization yields tight control of balance in the early phase of a trial. As $n_A$ and $n_B$ get larger, the scheme tends to approach simple randomization, so the advantage of such an approach occurs when the trial has a small target sample size.

8.5 - Minimization

Minimization is another, rather complicated type of adaptive randomization. Minimization schemes construct measures of imbalance for each treatment when an eligible patient is ready for randomization. The patient is assigned to the treatment which yields the lowest imbalance score. If the imbalance scores are all equal, then that patient is randomly assigned a treatment. This type of adaptive randomization imposes tight control of balance, but it is more labor-intensive to implement because the imbalance scores must be calculated with each new patient. Some researchers have developed web-based applications and automated 24-hour telephone services that solicit information about the stratifiers and a computer algorithm uses the data to determine the randomization

One popular minimization scheme is based on marginal totals of the stratifying variables. As an example, consider a three-armed clinical trial (treatments A, B, C). Suppose there are four stratifying variables, whereby each stratifier has three levels (low, medium, high), yielding $3^4 = 81$ strata in this trial. When 200 patients have been randomized and patient #201 is ready for randomization. The observations of the stratifying variables are recorded as follows.

Suppose that patient #201 is ready for randomization and that this patient is observed to have the low level of stratifier #1, the medium level of stratifier #2, the high level of stratifier #3, and the high level of stratifier #4. Based on the 200 patients already in the trial, the number of patients with each of these levels is totaled for each treatment group. (Notice that patients may be double counted in this table.)

Patient #201 would be assigned to treatment A because it has the lowest marginal total. If two or more treatment arms are tied for the smallest marginal total, then the patient is randomly assigned to one of the tied treatment arms. This is not a perfect scheme but it is a strategy for making sure that the assignments are as balanced within each treatment group with respect to each of the four variables.

8.6 - "Play the Winner" Rule

Another type of adaptive randomization scheme is called the "play the winner" rule. Suppose there is a two-armed clinical trial and the urn contains one "A" ball and one "B" ball for the first patient. Suppose that the patient randomly is assigned treatment A. Now you need to know if the treatment was successful with the patient that received this treatment. If the patient does well on treatment A, then the original "A" ball and another "A" ball are placed in the urn. If the patient fails on treatment A, then the original "A" ball and a "B" ball are placed in the urn. Thus, the second patient has probability of 1/3 or 2/3 of receiving treatment A depending on whether treatment A was a success or failure for the first patient. This process continues. If one treatment is more successful than the other, the odds are stacked in favor of that treatment.

The advantage of the "play the winner" rule is that a higher proportion of patients will be assigned to the more successful treatment. This seems to be an ethical approach.

The disadvantages of the "play the winner" rule are that:

sample size calculations are difficult, and
the outcome on each patient must be determined prior to the entry of the next patient.

Thus, the "play the winner" rule is not practical for most trials. The procedure can be modified, however, to be performed in stages. For example, if the target sample size is 200 patients, then the trial can be put on hold after each set of 50 patients to assess outcome and redefine the probability of treatment assignment for the patients yet to be recruited, i.e., "play the winner" after every 50 patients instead of every patient.

8.7 - Administration of the Randomization Process

Sas® example, providing permuted blocks randomization scheme for equal allocation to treatments a and b.

The RANUNI function in SAS yields random numbers from the Uniform(0,1) distribution (randomly selected a decimal between 0 and 1). These random numbers can be used to generate a randomization scheme. For example, suppose that the probability of assignment to treatments A, B, and C are to be 0.25, 0.25, and 0.5, respectively. Let U denote the random number generated and assign treatment as follows:

if $0.00 < U \leq 0.25$
if $0.25 < U \leq 0.50$
if $0.50 < U \leq 1.00$

This can be adapted for whatever your scheme requires.

Here is a SAS program that provides a permuted blocks randomization scheme for equal allocation to treatments A and B. In the example, the block size is 6 and the total sample size is 48.

Did you get something like this?

Remember, your output is not likely to be identical to what we got above, but the number assigned to each treatment should be the same in each group after every set of 4 patients.

Future treatment assignments in a randomization scheme should not be discoverable by the investigator. Otherwise, the minimization of selection bias offered by randomization is lost. The administration of the randomization scheme should not be physically available to the investigator. This usually is not the case in multi-center trials, but the problem usually arises in small single-center trials. Logistical problems can arise in trials with hospitalized patients in which 24-hour access to randomization is necessary. Sometimes, sealed envelopes are used as a means of keeping the randomized treatment assignments confidential until a patient is eligible for entry. However, it is relatively easy for investigators to tamper with the envelope system.

Example randomization plan

Many clinical trials rely on pharmacies to package the drugs so that they are masked to investigators and patients. For example, consider a two-armed trial with a target sample size of 96 randomized subjects (48 within each treatment group). The pharmacist constructs 96 drug packets and randomly assigns numeric codes from 01 to 96 which are printed on the drug packet labels. The pharmacist gives the investigator the masked drug packets (with their numeric codes). When a subject is eligible for randomization, the investigator selects the next drug packet (in numeric order). In this way the investigator is kept from knowing which treatment is assigned to which patient.

Here is a SAS program that provides ...

8.8 - Unequal Treatment Allocation

To maximize the efficiency (statistical power) of treatment comparisons, investigators typically employ equal allocation of patients to treatment groups (this assumes that the variability in the outcome measure is the same for each treatment).

Unequal allocation may be preferable in some situations. An unequal allocation that favors an experimental therapy over placebo could help recruitment and it would increase the experience with the experimental therapy. This also provides the opportunity to perform some subset analyses of interest, e.g., if more elderly patients are assigned to the experimental therapy, then the unequal allocation would yield more elderly patients on the experimental therapy.

Another example where unequal allocation may be desirable occurs when one therapy is extremely expensive in comparison to the other therapies in the trial. For budget reasons, you may not be able to assign as many to the expensive therapy.

If it is known that one treatment is more variable (less precise) in the outcome response than the other treatments, then the statistical power for treatment comparisons is maximized with unequal allocation. The allocation ratio should be

$r = n_1/n_2 = \sigma^1 / \sigma^2$

which is a ratio of the known standard deviations. Thus, the treatment that yields less precision (larger standard deviation) should receive more patients, an unequal allocation. Because there is more 'noise', more patients, larger sample size will help to cut through this noise.

8.9 - Randomization Prior to Informed Consent

Randomization prior to informed consent can increase the number of trial participants, but it causes some difficulties. This is not recommended practice. Here's why...

One particular scheme with experimental and standard treatments that has received some attention is as follows. Eligible patients are randomized prior to providing consent. If the patient is assigned to the standard therapy, then it is offered to the patient without the need for consent. If the patient is randomized to the experimental therapy, then the patient is asked for consent. If this patient refuses, however, then he/she is offered the standard therapy. An "intent-to-treat" analysis is performed based on the randomized assignment.

This approach can increase trial participation, but patients who are randomized to the experimental treatment and refuse will dilute the treatment difference at the time of data analysis. In addition, the "intent-to-treat" analysis will introduce bias.

There are ethical problems as well because:

subjects are randomized to treatment without having been properly informed and without providing their consent, and
subjects randomized to standard therapy have been denied the chance of receiving the experimental therapy.

For all of these reasons, randomization prior to informed consent is not recommended.

8.10 - Summary

In this lesson, among other things, we learned:

Snapsolve any problem by taking a picture. Try it in the Numerade app?

IMAGES

CONSORT diagram: participant flow through trial arms
This flowchart depicts the process of patient assignment to the trial
Figure. Allocation to trial arms and numbers of participants followed
Village selection and randomization into trial arms
In terms of explaining the probability of assignment to trial arms and
Proportion of patients randomised to each of seven trial arms during

VIDEO

ArmA 2
STAT280
Normal probability curve assignment of Assessment and Learning #anuradhabsr #assignment #subscribe
Ted Bundy Carries Boxes And Asks His Professor About The Midterm Assignment Trial 1979
PSA Dagger Assembly Walkthrough
Sets & Probability Assignment #1 Solutions

COMMENTS

III. Informed Consent Guidance
FDA regulations on consent do not require all consent elements recommended by GCP guidance. These required elements under GCP are: (a) That the trial involves research. (b) The purpose of the trial. (c) The trial treatment (s) and the probability for random assignment to each treatment. (d) The trial procedures to be followed, including all ...
CITI: Comparison Between ICH GCP E6 and U.S. FDA Regulations
Study with Quizlet and memorize flashcards containing terms like In terms of explaining the probability of assignment to trial arms in consent forms, which is true?, The new ICH E6(R2) integrated addendum requires sponsors to implement systems to manage quality throughout all stages of the trial process. The system should use a risk-based approach including which of the following?, Regarding ...
PDF Guidance.comparison of ICH FDA regulations
1. Discussion of trial related treatment and probability of random assignment, 2. subject responsibilities, 3. anticipated payment if any, 4. important potential risks and benefits of alternative treatment, 5. authorization to access medical records by regulatory authorities And FDA 50.25 a and 50.25b ICH requires the subject receive a
A Comparison of Randomization Methods for Multi-Arm Clinical Trials
Then, the probability of assignment to treatment k for patients i = 2, ... including the number of arms, the sample size of the trial, and the number of covariates considered. However, the goals for patient randomization in comparative clinical trials remain the same: to ensure the allocation is not predictable and to control imbalance in group ...
Adding experimental arms to platform clinical trials: randomization
2. Adding arms to an ongoing trial. We consider a clinical trial that initially randomizes patients to either the control arm or to experimental arms. For each patient , indicates that patient has been randomized to arm , where is the control arm. In what follows, counts the number of patients randomized to arm before the -th patient, while is the number of observed outcomes for arm before the ...
Stratified randomization for platform trials with differing
As an example, suppose a platform trial has two experimental arms, E 1 and E 2, and a subject's probability of being eligible to both experimental arms is α, and the probability of being eligible for only E 1 is 1 − α 2 and the probability of being eligible for only E 2 is also 1 − α 2. Each experimental arm plans to enroll 100 ...
Clinical Trial Randomization Tool
When both arms have equal number of enrollments, the probability is 50% to each arm. If assignment to the larger arm would violate the MTI, the probability is 100% and the next enrollment goes to the smaller arm. You may enter any value between 50.0% and 99.9%, but the higher the value, the less random the procedure. We default to 60.0%.
A simplified formula for quantification of the probability of
For example, for a trial with 2 treatment arms and a block size of 6, with balanced allocation randomization, the probability of deterministic assignment is 25%. If a 1:2 allocation ratio is used, the probability of deterministic assignment will be 28.9%. Figure 2 shows the relationship between the treatment allocation ratio and the probability ...
Information Theoretic Approach for Selecting Arms in Clinical Trials
Given m arms, the aims of phase I and phase II clinical trials are often to select the target arm (TA): the arm whose toxicity probability is closest to the maximal accepted target, 0 < γ t < 1, or (and) whose efficacy probability is closest to the target efficacy, 0 < γ e ⩽ 1, where higher values of γ e correspond to more effective arms ...
Statistical consideration when adding new arms to ongoing ...
Background Platform trials improve the efficiency of the drug development process through flexible features such as adding and dropping arms as evidence emerges. The benefits and practical challenges of implementing novel trial designs have been discussed widely in the literature, yet less consideration has been given to the statistical implications of adding arms. Main We explain different ...
Test: ICH—Comparison Between ICH GCP E6 and U.S. FDA ...
In terms of explaining the probability of assignment to trial arms in consent forms, which is true? Choose matching definition. The FDA regulations allow subjects or the legal only acceptable representations (LARs) to receive either a signed or unsigned copy.
Recommendations for designing and analysing multi-arm non ...
Multi-arm non-inferiority (MANI) trials, here defined as non-inferiority trials with multiple experimental treatment arms, can be useful in situations where several viable treatments exist for a disease area or for testing different dose schedules. To maintain the statistical integrity of such trials, issues regarding both design and analysis must be considered, from both the multi-arm and the ...
Principles of Clinical Trials: Bias and Precision Control
Because the probability of assignment is biased toward the group with fewer assignments, urn-adaptive designs adapt the probability of choosing the next treatment on the basis of the assignment ratio thus far. ... Although this chapter focused on 1:1 assignments, some trials may choose to use other assignment ratios. A common alternative is 2:1 ...
ICH—Comparison Between ICH GCP E6 and U.S. FDA Regulations Quiz
Study with Quizlet and memorize flashcards containing terms like What is the legal status of ICH in U.S.?, In terms of explaining the probability of assignment to trial arms in consent forms, which is true?, The new ICH E6 integrated addendum (R2) requires sponsors to implement systems to manage quality throughout all stages of the trial process. The system should use a risk-based approach ...
Lesson 8: Treatment Allocation and Randomization
For example, if the target sample size is 200 patients, then the trial can be put on hold after each set of 50 patients to assess outcome and redefine the probability of treatment assignment for the patients yet to be recruited, i.e., "play the winner" after every 50 patients instead of every patient.
Estimating Causal Effects in Trials Involving Multi-Treatment Arms
Non-compliance in Trials Involving Multi-Treatment Arms. Data analysis for randomized controlled trials (RCT) is often complicated by subjects who do not comply with their treatment assignment. Non-compliance in two-arm trials has been extensively studied (Angrist et al. 1996, Imbens & Rubin 1997a,b, Little & Yau 1998, Peng et al. 2004, Robins ...
comparison b/t ich gcp e6 and us fda regulations Flashcards
In terms of explaining the probability of assignment to trial arms in consent forms, which is true? ICH notes that it should be included, but does not specify how the information should be presented. The new ICH E6 integrated addendum (R2) requires sponsors to implement systems to manage quality throughout all stages of the trial process.
SOLVED: When explaining the probability of assignment to trial arms in
Step 1/3 The question asks about the true statement regarding the probability of assignment to trial arms in consent forms. Step 2/3 The correct answer is that the ICH notes that the probability should be included but does not specify how it should be presented.
Randomization to Randomization Probability: Estimating Treatment
Malani (Malani, 2006) investigated whether the probability of treatment assignment influenced the estimated treatment effects of pharmaceuticals by looking at the results of 200 ulcer trials and 34 statin trials. These studies represented a range of treatment assignment probabilities, although most were at 0.5 or 1.
CITI Training Flashcards
In terms of explaining the probability of assignment to trial arms in consent forms, ICH notes that is should be included...True or False. True. According to Belmont Report how can you describe the principle of informed consent. information, Comprehension, voluntariness.
Solved When explaining the probability of assignment to
See Answer. Question: When explaining the probability of assignment to trial arms in consent forms, which is true? A.FDA requires the probability to be expressed as a percentage chance. B. The use of a placebo arm does have to be specifically stated, but not the chance of assignment. C.ICH notes that it should be included, but does not specify ...
Between-Arm Comparisons in Randomized Phase II Trials
The trial is stopped after stage 1 if the difference in number of responders between the two arms are larger than d, which is chosen so that, when the two arms have a difference of 0.15 in response rate, the probability of selecting the inferior arm is controlled at a specified level.
ICH—Comparison Between ICH GCP E6 and U.S. FDA Regulations Quiz
Study with Quizlet and memorize flashcards containing terms like What is the legal status of ICH in U.S.?, In terms of explaining the probability of assignment to trial arms in consent forms, which is true?, ICE E6 has broader requirements than FDA or HHS concerning confidentiality of medical records and access by third parties. If investigators are complying with ICH E6 guidelines they must ...