Hypotheses

The null and alternative hypotheses for the equivalence test are presented below for two different approaches:

Difference of Means (DOM)

One common approach for assessing bio-equivalence involves comparing pharmacokinetic (PK) measures between test and reference products. This is done using the following interval (null) hypothesis:

Null Hypothesis (\(H_0\)): At least one endpoint does not meet the equivalence criteria:

\[H_0: m_T^{(j)} - m_R^{(j)} \le \delta_L ~~ \text{or}~~ m_T^{(j)} - m_R^{(j)} \ge \delta_U \quad \text{for at least one}\;j\]

Alternative Hypothesis (\(H_1\)): All endpoints meet the equivalence criteria:

\[H_1: \delta_L<m_{T}^{(j)}-m_{R}^{(j)} <\delta_U \quad\text{for all}\;j\]

Here, \(m_T\) and \(m_R\) represent the mean endpoints for the test product (the proposed biosimilar) and the reference product, respectively. The equivalence limits, \(\delta_L\) and \(\delta_u\), are typically chosen to be symmetric, such that \(\delta = - \delta_L = \delta_U\).

The null hypothesis (\(H_0\)) is rejected if, and only if, all null hypotheses associated with the \(K\) primary endpoints are rejected at a significance level of \(\alpha\). This ensures that equivalence is established across all endpoints simultaneously.

The DOM test can be implemented in sampleSize() by setting ctype = "DOM" and lognorm = FALSE.

For pharmacokinetic (PK) outcomes, such as the area under the curve (AUC) and maximum concentration (Cmax), log-transformation is commonly applied to achieve normality. To perform this transformation, the logarithm of the geometric mean should be provided to mu_list, while the logarithmic variance can be derived from the coefficient of variation (CV) using the formula:

\[ \text{Logarithmic Variance} = \log\left(1 + {\text{CV}^2}\right) \]

Equivalence limits must also be specified on the log scale to align with the transformed data.

Ratio of Means (ROM)

The equivalence hypotheses can also be expressed as a Ratio of Means (ROM):

Null Hypothesis (\(H_0\)): At least one endpoint does not meet the equivalence criteria:

\[H_0: \frac{\mu_T^{(j)}}{\mu_R^{(j)}} \le E_L ~~ \text{or}~~ \frac{\mu_T^{(j)}}{\mu_R^{(j)}} \ge E_U \quad \text{for at least one}\;j\]

Alternative Hypothesis (\(H_1\)): All endpoints meet the equivalence criteria:

\[H_1: E_L< \frac{\mu_{T}^{(j)}}{\mu_{R}^{(j)}} < E_U \quad\text{for all}\;j\]

Here, \(\mu_T\) and \(\mu_R\) represent the arithmetic mean endpoints for the test product (the proposed biosimilar) and the reference product, respectively.

The ROM test can be implemented in sampleSize() by setting ctype = "ROM" and lognorm = TRUE. Note that the mu_list argument should contain the arithmetic means of the endpoints, while sigma_list should contain their corresponding variances.

The ROM test is converted to a Difference of Means (DOM) tests by log-transforming the data and equivalence limits. The variance on the log scale is calculated using the normalized variance formula:

\[ \text{Logarithmic Variance} = \log\left(1 + \frac{\text{Arithmetic Variance}}{\text{Arithmetic Mean}^2}\right) \]

The logarithmic mean is then calculated as:

\[\text{Logarithmic Mean} = \log(\text{Arithmetic Mean}) - \frac{1}{2}(\text{Logarithmic Variance})\]

Regulatory Requirements

When evaluating bioequivalence, certain statistical and methodological requirements must be adhered to, as outlined in the European Medicines Agency’s bioequivalence guidelines (Committee for Medicinal Products for Human Use (CHMP) 2010). These requirements ensure that the test and reference products meet predefined criteria for equivalence in terms of PK parameters. The key considerations are summarized below:

Hypothesis testing should be based on the ratio of the population geometric means
The 90% confidence interval for the ratio of the test and reference products should be contained within the acceptance interval of 80.00 to 125.00%.
A margin of clinical equivalence (\(\Delta\)) is chosen by defining the largest difference that is clinically acceptable, so that a difference larger than this would matter in practice.
The data should be transformed prior to analysis using a logarithmic transformation and subsequently be analyzed using ANOVA

When conducting a DOM test, The FDA recommends that the equivalence acceptance criterion (EAC) be defined as \(\delta = EAC = 1.5 \sigma_R\), where \(\sigma_R\) represents the variability of the log-transformed endpoint for the reference product.

Testing of multiple endpoints

Assessment of equivalence is often required for more than one primary variable. (Sozu et al. 2015b) For example, EMA recommends showing equivalence both for AUC and Cmax

A decision must be made as to whether it is desirable to

Demonstrate equivalence for all primary endpoints
- This is the most common setting and is often referred to as having multiple co-primary endpoints.
- Equivalence must be demonstrated for all endpoints to conclude overall equivalence.
Demonstrate equivalence for at least one of the primary endpoints
- Known as having multiple primary endpoints.
- Equivalence is required for at least one endpoint to meet the study’s objectives.

Testing multiple co-primary endpoints

When a trial defines multiple co-primary endpoints, equivalence must be demonstrated for all of them to claim overall treatment equivalence. In this setting, each endpoint is tested separately at the usual significance level (\(\alpha\)), and equivalence is established only if all individual tests are statistically significant. Because conclusions require rejecting all null hypotheses, a formal multiplicity adjustment is not needed to control the Type I error rate. (Committee for Propietary Medicinal Products (CPMP) 2002) However, as the number of co-primary endpoints (\(K\)) increases, the likelihood of failing to meet equivalence on at least one endpoint also rises, resulting in a higher Type II error rate (i.e., a greater risk of incorrectly concluding non-equivalence) (Mielke et al. 2018)

This has several important implications:

Reduced Power in Rare Diseases. Previous studies have shown that the sample size required to maintain a given power level increases as the number of endpoints increases. (Mielke et al. 2018) This effect is particularly pronounced when the test statistics are uncorrelated or when a large number of tests are performed. In common conditions, this loss of power can often be compensated by increasing the sample size. In rare diseases, however, patient recruitment is often limited, making it more challenging to achieve equivalence across all endpoints and increasing the risk of an inconclusive result.
Alternative Statistical Approaches. To address power loss from requiring equivalence across multiple endpoints, alternative methods have been proposed. For example, one option is to power the study so that equivalence can be demonstrated for at least \(k\) tests, rather than requiring all endpoints to meet the equivalence criterion. Another approach is hierarchical testing, where endpoints are tested sequentially based on predefined rules, allowing for partial conclusions when equivalence is demonstrated in a subset of endpoints. See Testing multiple primary endpoints for more details.
Regulatory Considerations. Regulatory agencies often require a pre-specified statistical strategy to handle multiple endpoints in equivalence trials. Without proper planning, the risk of failing to establish equivalence in all endpoints may lead to inconclusive results, even if the treatments are meaningfully similar.

Testing multiple primary endpoints

When a trial aims to establish equivalence for at least \(k\) primary endpoints, adjustments are necessary to account for the increased risk of Type I error due to multiple hypothesis testing (Sozu et al. 2015a). Without such adjustments, the likelihood of incorrectly concluding equivalence for at least one endpoint increases as the number of endpoints grows.

For example, if a study includes \(m = 3\) independent primary endpoints and uses a significance level of \(\alpha = 5%\) for each test, the overall probability of falsely concluding equivalence for at least one endpoint is:

\[ 1 – (1-\alpha)^m = 1 - (1-0.05)^3 = 0.1426. \] This means that the overall probability of making any false positive error, also known as the family-wise error rate (FWER), increases to approximately 14%.

To address this issue, adjustments to the significance level are necessary for multiple endpoint comparisons for which various methods have been proposed. In SimTOST, the following approaches are included:

Bonferroni correction

The most common and easiest procedure for multiplicity adjustment to control the FWER is the Bonferroni method. Each hypothesis is tested at level

\[\alpha_{bon}= \alpha/m\]

where \(m\) is the total number of tests. Although simple, this method is highly conservative, particularly when tests are correlated, as it assumes all tests are independent. This conservativeness remains pronounced even for \(k=1\), where only one of the \(m\) hypotheses needs to be rejected. (Mielke et al. 2018)

In the sampleSize() function, the Bonferroni correction can be applied by setting adjust = "bon".

Sidak correction

The Sidak correction is an alternative method for controlling the FWER. Like the Bonferroni correction, it assumes that tests are independent. However, the Sidak correction accounts for the joint probability of all tests being non-significant, making it mathematically less conservative than the Bonferroni method. The adjusted significance level is calculated as:

\[\alpha_{sid}= 1-(1-\alpha)^ {1/m}\]

The Sidak correction can be implemented by specifying adjust = "sid" in the sampleSize() function.

K adjustment

This correction explicitly accounts for the scenario where equivalence is required for only \(k\) out of \(m\) endpoints. Unlike the Bonferroni and Sidak corrections, which assume that all \(m\) tests contribute equally to the overall Type I error rate, the k-adjustment directly incorporates the number of endpoints (\(k\)) required for equivalence into the adjustment. The adjusted significance level is calculated as:

\[\alpha_k= \frac{k*\alpha}{m}\]

where \(k\) is the number of endpoints required for equivalence, and \(m\) is the total number of endpoints evaluated.

Hierarchical testing of multiple endpoints

Hierarchical testing is an approach to multiple endpoint testing where endpoints are tested in a predefined order, typically based on their clinical or regulatory importance. A fallback testing strategy is applied, allowing sequential hypothesis testing. If a hypothesis earlier in the sequence fails to be rejected, testing stops, and subsequent hypotheses are not evaluated. (Chowdhry et al. 2024)

To implement hierarchical testing in simTOST, the user specifies adjust = "seq" in the sampleSize() function and defines primary and secondary endpoints using the type_y vector argument. The significance level (\(\alpha\)) is adjusted separately for each group of endpoints, ensuring strong control of the Family-Wise Error Rate (FWER) while maintaining interpretability.

Evaluate (co-)primary endpoints
- Testing begins with the pre-specified (co-)primary endpoints at the nominal significance level.
- Equivalence must be demonstrated for all (co-)primary endpoints.
- If any primary endpoint fails to meet the equivalence criteria, testing stops, and secondary endpoints are not evaluated.
Proceed to secondary endpoints (if applicable)
- If all primary endpoints meet the equivalence criteria, testing proceeds to the secondary endpoints.
- Equivalence must be demonstrated for at least k secondary endpoints.
- The significance level for secondary endpoints is adjusted using k-adjustment.
Final Decision
- The test is considered successful if all (co-)primary endpoints meet the equivalence criteria and at least k secondary endpoints demonstrate equivalence.

An example of hierarchical testing can be found in this vignette.

Testing of multiple treatments

In certain cases, it may be necessary to compare multiple treatments simultaneously. This can be achieved by specifying multiple comparators in the mu_list and sigma_list parameters. The sampleSize() function can accommodate multiple treatments, allowing for the evaluation of equivalence across different products or formulations.

Although trials with multiple arms are common, there is no clear consensus in the literature as to whether statistical corrections should be applied for testing multiple primary hypotheses in such analyses. In SimTOST, no adjustments are made for trials involving more than two treatment arms.

References

Chowdhry, Amit K., John Park, John Kang, Gukan Sakthivel, and Stephanie Pugh. 2024. “Finding Multiple Signals in the Noise: Handling Multiplicity in Clinical Trials.” International Journal of Radiation Oncology*Biology*Physics 119 (3): 750–55. https://doi.org/10.1016/j.ijrobp.2023.12.007.

Committee for Medicinal Products for Human Use (CHMP). 2010. “Guideline on the Investigation of Bioequivalence.” CPMP/EWP/QWP/1401/98. London, UK: European Medicines Agency. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-investigation-bioequivalence-rev1_en.pdf.

Committee for Propietary Medicinal Products (CPMP). 2002. “Points to Consider on Multiplicity Issues in Clinical Trials.” Scientific Guideline CPMP/EWP/908/99. London, UK: The European Agency for the Evaluation of Medicinal Products.

Mielke, Johanna, Byron Jones, Bernd Jilma, and Franz König. 2018. “Sample Size for Multiple Hypothesis Testing in Biosimilar Development.” Statistics in Biopharmaceutical Research 10 (1): 39–49. https://doi.org/10.1080/19466315.2017.1371071.

Sozu, Takashi, Tomoyuki Sugimoto, Toshimitsu Hamasaki, and Scott R. Evans. 2015a. “Continuous Primary Endpoints.” In Sample Size Determination in Clinical Trials with Multiple Endpoints, 59–68. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-22005-5_5.

———. 2015b. Sample Size Determination in Clinical Trials with Multiple Endpoints. SpringerBriefs in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-22005-5.

Introduction