This article, by Fredric Cohen, discusses the article posted last night "Cancer Cruel Economics," for those of you who have an interest in biotech and cancer drug development.
Evaluating cancer drugs at FDA
In the June 2nd paper issue of BusinessWeek (published online 5/21) the article “Cancer’s Cruel Economics” by Catherine Arnst provides a high-level look at the difficulty some small copmpanies are facing getting their cancer therapies approved in the US.
The focus is on cancer immunotherapies, particularly Antigenics’ Oncophage. I last discussed Oncophage in 2006, after the first report of its Phase 3 results. I have also devoted space in this forum to other cancer therapies mentioned in the BusinessWeek article including Dendreon’s Provenge and Genitope’s MyVax.
I was intrigued by a quote attributed to Richard Pazdur, head of CDER’s Oncology review division:
“[Post hoc subgroup analysis for differential treatment effect] is like shooting an arrow and then painting the bull’s-eye around it,” says Pazdur. “You cannot use subset analysis to salvage a failed trial.”
Pazdur’s concern regarding treatment effect inferences derived from post hoc subgroup (subset) analyses rests on firm grounds, but the quote suggests a black and white attitude towards their utility, without any room for compromise. That’s too bad, because the rule of thumb Pazdur is apparently using to reject subgroup evidence of efficacy is imperfect, undoubtedly resulting in the rejection of some effective therapies.
I’m not going to write a manuscript-length post describing the many risks inherent in inferential subgroup analyses. There are many published reviews you can find that do that. Suffice to say that the risks of both false-negative and false-positive inferences are inflated with subgroup analyses relative to the main analysis (primary hypothesis test), whether the analyses are pre hoc (defined before the trial results accrue) or post hoc (sometimes called retrospective). Pre hoc analyses are less susceptible to Pazdur’s target drawing, especially when the specifics of the subgroup are rigorously pre-defined than post hoc, and so they are preferred by regulators.
What I’ve found to be less well represented in the literature is a situation in which the weight of evidence presented in the subgroup analysis is sufficient to, as Pazdur says, “rescue a failed study.” I’ll not focus specifically on Provenge or Oncophage, but the example from the literature I’ll cite is relevant to both.
In order to determine whether any subgroup analyses provide evidence sufficient to warrant drug approval, it is necessary to first know the expected false-positive rate of a subgroup analysis, given a false-positive rate of 5% in the overall (main) analysis. A 5% rate is chosen, because that rate is usually considered an acceptable one by clinical practitioners and drug regulators. Thus, the null hypothesis is rejected falsely in 1 in 20 trials. FDA usually requires two independent experiments (trials) for evidence of efficacy, resulting in an overall false-positive rate of 2.5% (0.05×0.05), though one statistically significant experiment with corroborating evidence from others is sometimes sufficient, particularly for accelerated approvals.
In an important study published in 2001 by the UK’s NHS, Brookes et al used simulations of 100,000 clinical trials each to determine the false-positive (and false-negative) rates of subgroup analyses for different types of study designs. They simulated two subgroup analysis (ignoring the effects of multiple analyses, which inflate Type 1 error) and tested a variety of relative treatment effect and subgroup sizes.
The simulations showed that when there was in reality no main or subgroup effect of treatment, and the overall (main) analysis of treatment was falsely positive (i.e. null hypothesis was rejected at the nominal p<0.05) then the chance of falsely declaring one subgroup as demonstrating a treatment effect was high. For a survival study, this chance was 61%. In other words, with no real main treatment or subgroup effect, when there was a false-positive main effect, one of two subgroups analyzed will appear to have a treatment effect well over half the time.
However, under the same set of circumstances, when the main effect is not rejected (i.e. a true negative inference is made), then one subgroup will show evidence of a treatment effect much less often, only 6.5% of the time, approaching the overall effect false-positive rate of 5%. In other words, the probability of falsely rejecting a subgroup-specific null hypothesis in the absence of overall and subgroup effects is reasonably low if the overall effect is correctly negative.
Of course, the above simulation findings aren’t by themselves capable of determining whether a subgroup-specific effect is real or not. They simply suggest that the regulator need not reject out-of-hand statistical evidence of a subgroup-differential treatment effect when evidence for an overall effect is absent, as Dr. Pazdur’s quote suggests he is willing to do in some cases.
Evidence that the apparent subgroup effect in a survival study is real will be strengthened by the following factors:
- The main effect does not contradict the purported subgroup effect
- The subgroup-specific analysis was defined a priori
- A significant test of interaction between overall treatment effect and the subgroup is in evidence prior to any subgroup-specific test
- The total number of subgroups analyzed is small, and, if not, an inference of treatment effect made on any one subgroup uses an appropriately conservative adjustment of the significance level
- There is strong biological plausibility for the differential subgroup effect
- The size of the subgroup is large relative to the total sample size (i.e. relatively representative of the total population)
- The conduct of the study, particularly the handling of dropouts and non-compliant subjects, creates confidence in the quality of the subgroup data
Finally, as I’ve argued before in the case of Provenge, when the evidence of efficacy is marginal, regulators have a duty to the public they serve to weigh with utmost care and without bias the risk of introducing an ineffective medicine versus the risk of withholding ready availability of an effective medicine from a gravely ill population without other treatment options.