Sample size calculation for small sample single-arm trials for time-to-event data: Logrank test with normal approximation or test statistic based on exact chi-square distribution?

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Associated Data

Multimedia component 1 GUID: 9CCD1702-D59A-45FB-9921-9A28D244585C

Abstract

Background

Sample size calculations are critical to the planning of a clinical trial. For single-arm trials with time-to-event endpoint, standard software provides only limited options. The most popular option is the log-rank test. A second option assuming exponential distribution is available on some online websites. Both these approaches rely on asymptotic normality for the test statistic and perform well for moderate-to-large sample sizes.

Methods

As many new treatments in the field of oncology are cost-prohibitive and have slow accrual rates, researchers are often faced with the restriction of conducting single arm trials with potentially small-to-moderate sample sizes. As a practical solution, therefore, we consider the option of performing the sample size calculations using an exact parametric test with the test statistic following a chi-square distribution. Analytic results of sample size calculations from the two methods with Weibull distributed survival times are briefly compared using an example of a clinical trial on cholangiocarcinoma and are verified through simulations.

Results

Our simulations suggest that in the case of small sample phase II studies, there can be some practical benefits in using the exact test that could affect the feasibility, timeliness, financial support, and ‘clinical novelty’ factor in conducting a study. The exact test is a good option for designing small-to-moderate sample trials when accrual and follow-up time are adequate.

Conclusions

Based on our simulations for small sample studies, we conclude that a statistician should assess sensitivity of his calculations obtained through different methods before recommending a sample size to their collaborators.

Keywords: Clinical trial, Exact test, Single-arm, Survival, Weibull

1. Introduction

Two-arm randomized clinical trials are the gold standard in biomedical research as they allow performance assessment of a new experimental treatment relative to a standard control. However, there are situations where conducting a two-arm trial is not possible and a single-arm trial may be the preferred choice. For single-arm trials with a time-to-event endpoint, surprisingly few options for sample size calculation are available in literature or in standard software. The most popular option is the log-rank test [1] and its weighted versions. It has been used for sample size calculations by Finkelstein et al. [2], Kwak and Jung [3], Jung [4], Sun et al. [5] and more recently by Wu [6]. Likewise, sample size calculations for exponentially distributed survival times have been proposed by Lawless [7] (available as online calculators; see SWOG [8]). Both approaches rely on asymptotic normality of the test statistic and perform well for moderate-to-large sample sizes. As many new treatments in the field of oncology are cost-prohibitive and have slow accrual rates, researchers are often restricted to conducting single-arm trials with small-to-moderate sample sizes.

The sample size formula proposed by Wu [6] is based on the exact variance of the test statistic and hence is an improvement on the earlier versions of the logrank test. Wu [6] has mentioned in his concluding remarks that his one-sample logrank test is conservative when dealing with small samples and that the correctness of its use depends on the correct specification of the underlying distribution of the standard population. In this context, we bring to the reader's attention that a parametric method of calculating sample size for exponentially distributed times was first published by Epstein and Sobel [9]. This method uses a test statistic that follows a chi-square distribution. Later, Narula and Li [10] have shown how to extend the calculations to the case of gamma, Weibull, and Laplace distributions in the uncensored case. One important point to note is that an iterative search algorithm may be needed to calculate the sample size given the value the other fixed parameters using their approach and to avoid this Narula and Li [10] also mention five different closed-form solutions based on a normal approximation. Surprisingly, popular statistics software does not have options for such calculations though PASS [11] has incorporated the logrank calculations of Wu [6].

2. Methods

The Weibull distribution is a two-parameter distribution with its pdf given by:

f ( t ) = β θ β t β − 1 e − ( t / θ ) β θ , β > 0 , t > 0

here θ is a scale parameter and β is a shape parameter that determines the shape of the hazard function ( β > 1 gives hazard that is increasing over time, and, β < 1 gives hazard that is decreasing over time with β = 1 representing the special case of exponential distribution with constant hazard).

With modern computational tools, we can write an efficient SAS program for an iterative approach using the formula given in Narula and Li [10] accounting for administrative censoring. That is, following Narula and Li [10], the problem of calculating sample size n (without censoring) in the Weibull case to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ = θ 1 ( < θ 0 ) at level of significance α and probability of type II error γ reduces to solving for δ using

δ = χ 1 − γ 2 ( v ) / χ α 2 ( v )

with δ = θ 0 / θ 1 and v = 2 n . Our program then adjusts their method for administrative censoring accounting for study-specific accrual and follow-up times in the following way:

Assuming a uniform accrual, the censoring distribution function G ( t ) is given by

where a and f are the accrual and follow-up time respectively. Then the probability that a subject experiences a failure during the study is given by

d = ∫ 0 ∞ G ( t ) . f 1 ( t ) d t

where f 1 ( t ) is f ( t ) with θ = θ 1 . Dividing the number of events by d gives the sample size adjusted for administrative censoring. Alternatively, d can be calculated using Simpson's rule by

where S 1 ( t ) is the survival function of the Weibull with θ = θ 1

For the Weibull, this allows comparison with Wu [6] and for the special case of the exponential, this allows comparison with Lawless [7]. To do so, we consider a real-life example about designing a phase II clinical trial for treating patients suffering from chemotherapy refractory advanced metastatic biliary cholangiocarcinoma, a “rare” but aggressive neoplasm. Such patients have metastatic disease and undergo an initial treatment followed by a second-line treatment which has a progression-free survival (PFS) rate of 5–10% by 1 year. Oncologists are therefore working towards improving the PFS by using new combination therapies. Historically, published literature mentions a median PFS of 2.5 months with an IQR of around 2–5 months. Due to dismal survival rates, they consider an improvement in 25th, 50 th and 75th percentile of PFS by a factor of 1.5 as clinically meaningful and holding promise for future large sample studies. The rarity of disease poses recruitment problems with typical accrual rates being approximately 12–15 patients/year. Based on financial and administrative limitations, researchers envision a study with an accrual time of 2 years and follow-up time of 3 years. Loss to follow-up is anticipated to be 15–20%. It should be noted that as the researchers hypothesize a consistent improvement in PFS for all quantiles of the survival curve - of the historical controls by a factor of 1.5, the Weibull distribution is a good choice for performing the sample size calculations as is evident from the definition of δ in (1). Following Wu [6], the shape parameter β for the Weibull is estimated from the historical controls as 1.25 (increasing hazard).

3. Results

For the study-specific design features in this example, Table 1 shows the comparisons between the two methods for various values of the shape parameter β ranging from 0.5 to 1.5. The conservative nature of the logrank test can be studied by observing how in Table 1 the sample sizes vary as a function of accrual and follow-up time (keeping other design parameters fixed) for the different values of the shape parameter. When β = 0.5, we see that Wu's logrank test gives smaller sample sizes compared to the exact method only when both a and f are small in magnitude. On the other hand, as either a or f increases, the exact test yields smaller sample sizes. This general pattern is even more accentuated as β increases from 0.5 to 1.5. In fact, for β = 1.5, only a ≤ 3 and f ≤ 3 allow the logrank test to have smaller sample sizes than the exact method. As in the cholangiocarcinoma example under consideration, researchers hypothesize improvement in median PFS by a factor of 1.5, small values of a and f are impractical as based on the accrual rate, very few patients can participate in this study.

Table 1

Number of events/sample size for exact vs Wu's method (administrative censoring adjustment by equation [4]) for different values of Weibull shape parameter β , accrual time a , and follow-up time f .

Table 1

a = accrual time in months, f = follow-up time in months.

‘Exact’ refers to the exact calculations done using the chi-square distribution.

Note: The calculations given by Lawless [6] were done using an online calculator by SWOG [7] and only show the total sample size and not the number of events.

For this example, where β = 1.25 is chosen, the exact method gives a sample size of 24 when a ≥ 15 and f ≥ 12 whereas the logrank test finds a lower bound at 29 no matter how big a and f are chosen. That is, even with the flexibility to follow patients for a hypothetically large amount of time and thereby observe all events, Wu's method does not go below a threshold value of 29 events. Through simulations (we used 10,000 simulations) using the Weibull distribution, it can be shown that with large follow-up times, 80% power is achieved only with 24 subjects and the exact method is analytically able to yield a sample size of 24. By adopting the popular ad-hoc method of inflating the sample size to accommodate drop-outs (conservatively assuming they provide no extra information), the adjusted sample size can be calculated as 24/0.8 = 30. That is, if the researcher's ‘optimistic’ estimate of accruing 15 patients/year is true, it appears likely that this study can be completed within the stipulated timeframe. A similar ad-hoc adjustment for Wu's method would require 37 patients to be enrolled, which is outside the practical timeframe of the study. However, by assuming that drop-outs occur randomly over the study period (assuming a uniform distribution), for an anticipated drop-out rate of 20%, our simulations gave a sample size of n = 28 with 80.6% power. Thus, a combination of analytical calculations using the exact method aided by further simulations can enable a statistician to design a small sample trial with adequate power. If additional information from similar such studies is available, a statistician can also incorporate other drop-out mechanisms (such as exponentially distributed drop-out times with a specific mean).

Similar comparisons can be performed for other values of β such as β = 0.75 (decreasing hazard) and β = 1 (constant hazard – exponential distribution). In the case of β = 1, it can be seen that the normal approximation proposed by Lawless [7] gives smaller n than the exact method for small-to-moderate values of follow-up time. However, with large values of follow-up time, this is no longer the case and the normal approximation cannot yield sample sizes below n = 41. Through simulations (we used 10,000 simulations) using the exponential distribution, it can be shown that with large follow-up times, 80% power is achieved only with 37 subjects and the exact method is analytically able to yield a sample size of 37 whereas the normal approximation method and the logrank test yield sample sizes of 40 and 43 respectively.

Though the exact method yields smaller sample sizes for many situations, it is necessary to assess whether or not the empirical type I error rate and empirical power are close to their nominal values. To do this, a simulation study (with 10,000 simulations) was conducted with the study-specific design parameters of the cholangiocarcinoma study. For varying values of a and f for β ranging from 0.5 to 1.5, time-to-event data was simulated with the sample sizes calculated by the exact method and the results are displayed in Table 2 . From this table it can be see that for almost all scenarios the empirical type I error rates were close to the nominal 5% alpha level. Likewise, empirical power always slightly exceeded the target 80% power. Except for seemingly impractical values such as a = 0 (all subjects are available for recruitment at the start) and f = 1 (a very short follow-up time for the study under consideration), there was not much difference in the magnitude by which empirical power exceeded the target power. As the results displayed in Table 2 are in the context of a specific example with a fixed effect size, a similar evaluation of empirical type I error and power was conducted for the hypothetical example discussed in Wu [6] where in a = 3 and f = 1 was kept fixed, but the effect size was varied from small to large, for β = 0.1, 0.25, 0.5, 1, 2, and 5. The results of these simulations are shown in Table 3 with target power at 90% and median time under the null hypothesis fixed at 1. Here too, for all scenarios, empirical type I error was close to the nominal level, and likewise, empirical power always slightly exceeded the 90% target level. For β ≤ 1 , the exact method yielded somewhat higher empirical power compared to Wu's method, while the converse was true for β > 1 . Though at first glance Table 3 suggests that the exact method yields smaller sample sizes than the logrank test (with both methods having comparable empirical type I error and power) only for β ≥ 1 , it should be noted that the combination of a = 3 and f = 1 represent values that are quite small compared to the magnitude of the hypothesized improvement in median lifetime under the alternate hypothesis. For example, the combination of β = 0.25 and δ = 1.6 indicates that the median under the alternate hypothesis is 6.55 times the median under the null (which is fixed at 1). Likewise, the combination of β = 0.10 and δ = 2 indicates that the median under the alternate hypothesis is 1024 times the median under the null (which is fixed at 1). Thus, in such scenarios, it would be impractical to choose small values of a and f . As the values of a and f increase, compared to the logrank test, the exact method gives smaller sample sizes for β ≥ 0.5 and same sample sizes for β = 0.25 (see results in Table 4 ). Though the logrank test gives smaller sample sizes for β = 0.1 , such a small value of the shape parameter would require a very strong justification in a real life clinical trial.

Table 2

Evaluation of empirical type I error and empirical power using the exact method for the cholangiocarcinoma study with H 0 : t m e d ≤ 2.5 months and effect size equal to improvement in median time by a factor of 1.5 - using 10,000 simulations (nominal type I error 5%, target power 80%).