Radiological devices panel meeting

Download 0.49 Mb.
Date conversion15.05.2016
Size0.49 Mb.
1   2   3   4   5   6   7

Clinical Studies

DR. HENDRICK: It is a pleasure to be here. It is also a pleasure to follow the person presenting the physics and to get the clinical results for a change.

My name is Ed Hendrick. I want to do the public-disclosure thing so I want to let you know I own a few shares of G.E., unfortunately not a significant fraction of the company. My wife owns some shares of G.E. as well.

I have had research agreements with G.E. at the University of Colorado Health Sciences Center where I was a professor prior to October and I hope to have a research agreement with G.E. at Northwestern University but that hasn't been executed yet. It might be in the millennium if we can keep it out of the hands of the lawyers from the two institutions.


I want to present the clinical results that have come about from the trial that we have conducted. The goal of this is establish the noninferiority of digital compared to screen film. Based on the meetings that we had with the FDA and the public meeting in August of 1998, we adopted a noninferiority approach rather than an equivalence approach. That is what I want to discuss.

So we are following the guidelines that were laid out by Dr. Schultz in addressing the PMA approach which is to establish the safety and effectiveness of full-field digital mammography both for screening and diagnosis of breast cancer.


The presentation will first talk about the study cohort, the people that were enrolled, the women that were enrolled in the study. And then I will talk about the results of two reader studies and the results of a side-by-side analysis comparing features on digital to features on film screen side by side and then I will give some conclusions.


For the enrollment of clinical subjects, all of these were consented by IRB and they were enrolled at four institutions consisting of women over the age of 40 attending for diagnostic mammography. The four institutions were my former institution, the University of Colorado Health Sciences Center, the University of Massachusetts Medical Center, Mass General Hospital and the University of Pennsylvania Hospital.


The exclusion criteria for women in the study were women under the age of 40, women who were pregnant or suspicious of being pregnant, women with breast implants, women with breasts too large to fit on a 24-by-30 CM image receptor which is the larger image receptor used for film-screen mammography, women who didn't qualify for diagnostic mammography because they had non-focal or bilateral breast pain, and women who were unable or unwilling to execute the consent form.


The study cohort for the diagnostic study consisted of 641 women enrolled as diagnostic subjects at those four institutions. There were an additional 21 women, and it was the first 21 women with cancer out of a total of about 4,000 women who had been screened at that time in an additional study of digital mammography comparing it to film screen which was being conducted at the University of Colorado Health Sciences Center and the University of Massachusetts Medical Center, and that was a screening-based study so there were an additional 21 women with cancer who were additionally consented to have their images included in the reading studies that took place for this PMA.


The patient demographics of the diagnostic study population are given here. The mean age was 55 and the range from 40 to 86. The ethnicity is given in the table here as well. 34 percent reported a history of breast disease and 33 percent reported a history of hormone-replacement therapy.


The imaging techniques that were used were to views of each breast, the standard CC and MLO views, in both film-screen mammography and digital mammography. They were performed on each study volunteer. 59 percent were bilateral exams, 41 percent unilateral diagnostic exams.


The same target filter kVp and approximately the same mAs were used on both the full-field digital system as were used on the film-screen system. When the mAs couldn't be matched exactly, the full-field digital used a slightly lower mAs to insure that we had equal or slightly lower doses in full-field digital compared to screen-film mammography.

The technologists, in most cases, was the same person performing screen film and full-field digital and they used the same basic X-ray design for the compression of the breast and the positioning of the breast is the GED mR system for film screen and the prototype digital systems based on the GED mR so that positioning and compression forces were similar in the two modalities.


Just to present one of the more important results in terms of safety, there were no adverse consequences, serious or otherwise, reported among all the study subjects in the study cohort in this PMA study.


Let me talk about the different reading studies in the side-by-side analysis that were done. These were conducted sort of consecutively. The first reading study used an adjudication process where each image was read by two reviewers. So full-field digital was read by two reviewers and screen film was read by the same two reviewers.

One of the designs of this first reading study was that the readers would not be at the institutions in which the images were acquired. That is actually a design of both reader studies. In the first reading study, we had 646 subjects getting both full-field digital and screen-film images. 47 of those were cancers. 599 were non-cancers.

When I say they were read by two or three, it is because of the adjudication process. If the two initial readers agreed on the positivity or negativity on a given modality--say, for screen film--then that was the determination for that modality. But if they disagreed for screen film on whether it was positive or negative, then it went on to an adjudicating reader. That was the third reader who was the tie-breaker and decided, in the adjudicated readings, whether it was positive or negative.

That design was eliminated in the second study which consisted of 625 subjects getting both digital and screen film. In that study, there were five readers reading every image and each reader read both digital and screen-film images and, in each case, those were spaced out in time to avoid any kind of recall effect.

So there was at least a 30-day period between a reader reading a woman's, say, screen film and reading their digital image. Each reader read half the screen-film images first and half digital images first.

I will talk about the side-by-side reading study a little later.


The differences between the first and second reader studies were that, for the first reader study, each case had two primary readers and if they differed in positivity or negativity, a secondary reader, who was actually a third reader.

The data were analyzed in two ways based on the primary interpretations and also analyzed based on the adjudicated interpretation. In reader study No. 2, all five readers read each case on each modality. Part of the reason for going on and conducting reader study No. 2 was the analysis of data in reader study No. 1 was made more difficult by the fact that not all readers read all images.

So it eliminated some of the possible statistical methods that could account for multiple readings of the same images or it made it much more difficult to conduct those kinds of analyses.


Also, after conducting the first reader study, we learned some things about digital mammography that helped us do a better job in the second reader study. One of the things we learned was, in doing the side-by-side analysis between the first and second reader study, that in a few images, there were lesion markers on some films on one modality that were not visible in the other modality.

So, in preparation for the second reader study, we eliminated any images where there were different markers on one modality than on the second. Also, we learned a lot about printing digital images in the course of conducting the first study and recognized that the print quality on the digital images wasn't always up to par.

I, personally, reviewed the digital images not looking at the film screen but looking at the quality of printouts on the digital images and had some of those images reprinted prior to the conduct of the second study.

The readers used in the second study were from a single institution and they were tested and selected--we tested nine people and picked six readers out of that group, or five readers out of that group, of nine. Then the readers received uniform instructions prior to the study.

One of the things we didn't do in the first reader study that we should have done was tell readers that these images should have been read as a screening exam since there weren't prior films. And we didn't do that, so, in the first study, we learned some readers read these as screening cases, some read them as diagnostic cases, which led to big differences among the performance of the readers in the first study.

In the second study, we instructed the readers to read these as if they were screening cases since there weren't prior films or the presence of a diagnostic workup on those images.


In both readers studies, we asked to readers to provide a BIRADS code, 0, 1, 2, 3, 4 or 5, and the 0 was included because we did want them to read it as a screening study. And, in addition, for anything that had any BIRADS code other than 1 or 2, we asked them to provide a percent probability of that identified lesion or breast as having cancer. That was on an integer scale from 0 to 100 percent.

In the side-by-side reader study, it was a different design with the Likert scale that I will describe in just a minute.


The null hypotheses are the key to this non-inferiority approach. The null hypothesis is sort of the straw man that you set up to see if you can reject that based on the data. The null hypotheses are in three areas in terms of recall rates, or specificity--and just bear in mind, the specificity is 1 minus the recall rate.

In terms of recall rates, the null hypothesis was that digital had a higher recall rate than screen-film mammography by 0.05 or more. One of the concerns that FDA had with this new technology was that digital would have a higher recall rate, prompt the recall of more women, but not find more breast cancer. So the straw man for recall rates is that digital has a higher recall rate by 0.05 or more.

For sensitivity, the null hypothesis is that digital has a lower sensitivity than screen film by 0.1 or more. And, for ROC curve areas, the null hypothesis is that digital has a lower ROC curve area by 0.1 or more compared to screen-film mammography.

We collected data and analyzed data in terms of recall rates, sensitivity and ROC curve areas to test the null hypotheses.


Here is the first set of data from the first reading study. These data are for recall rates. In this table, there are two columns, one for all cases and then for non-cancer cases. In each case, the recall rate for digital was lower than the recall rate for screen film. When we tested the null hypothesis without taking into account the correlation that multiple readers read the same films, we would get a p-value of less than 0.001 in each case.

Also, in the adjudicated readings, which don't have the correlation problem because we get a single determination based on the best two out of three readings, digital also had a lower recall rate and we were able to reject the null hypothesis with a high degree of statistical significance.


This is where the statistics comes in. These terms PROC MIXED and PROC GENMOD are just fancy names for other statistical tests that were conducted to analyze the effect, and take out the effect, of multiple readers reading the same cases. In the case of PROC MIXED, it includes all the cases that were read.

In the PROC GENMOD method, it only includes the cases where the readers disagreed between the two modalities. So it only includes the cases where digital recalled the patient and film screen didn't, or film screen recalled the patient and digital didn't.

It doesn't include the data where the two modalities agree and because there are fewer data in the disagreement areas, the p-values are somewhat higher. But, in all cases analyzing recall rate, digital had a statistically significantly lower recall rate in that it was able to reject the null hypothesis regardless of the specific statistical tests that were used to account for multiple readers reading the same images.


In terms of sensitivity in reader study No. 1, digital had a sensitivity rate of 78 percent, just combining the primary readings of all the cases, and screen film had a sensitivity rate of 74 percent. In either case, where we didn't take into account the correlation among the readings by the two primary readers of each case or where we did take that into account by the PROC MIXED method, we get a statistically significant rejection of the null hypothesis.

So digital doesn't have a significantly lower sensitivity than film screen.

When we look at the adjudicated readings, the two modalities had exactly the same sensitivity and we were right on the edge of being able to reject the null hypothesis with statistically significance.


One of the concerns the FDA had in the design of this kind of trial was that if there is some hand picking of the cases involved that the study could use larger, easier-to-detect, later-stage cancers that really make no difference in the outcome for the woman.

What we did was took all women coming for diagnostic mammography plus this subset of cancers from the screening study without any selection process along the way. I think that is reflected in the stage distribution and the size distribution of the cancers that were detected in this study.

There were 47 cancers in reader study No. 1 and, in terms of the number of cancers with stage 0 or I, a total of 58 percent. The AHCPR guidelines recommend that, in a good mammography practice, you should have greater than 50Êpercent of your cancers be stage 0 or I. The stage distribution of the study group exceeded that AHCPR guideline.

In terms of minimal cancers--that is, stage 0 or I cancers that are less than 1 centimeter in size, it was 39Êpercent in this study group for reader study No. 1 and AHCPR recommends that, in a good mammography practice, that should exceed 30 percent. So we met those criteria for the kind of stage distribution and size distribution that you would hope to find in a good mammography practice.


Looking specifically at the way digital compared in terms of sensitivity for these earlier stage and minimal cancers, digital actually did even better compared to screen film in terms of sensitivity for stage 0 and I cancers based on 27 in those two categories. Digital had an 85 percent sensitivity compared to 74 for screen film. For minimal cancers, digital had a sensitivity of 83 percent compared to 70 for screen film.

So for the cancers that are probably the most critical in terms of making a difference in saving the woman's life, digital did even better than screen film.


Here are the ROC curve areas for digital compared to screen film. Digital had a lower ROC curve area. This is combining all the primary readings in study No. 1. Digital had a lower ROC curve area by 0.01, actually 0.009. So, for all practical purposes, the ROC curve areas were the same.

When we applied the statistical test to reject to null hypothesis that digital had a lower ROC curve area by 0.1 or more, we were able to do that with either all primary readings combined or with the adjudicated readings.


In summary, for reader study No. 1, we were able to show that digital did not have a significantly higher recall rate--in fact, it had a lower recall rate from screen film. We were able to show that it had very similar sensitivity to screen film and somewhat better sensitivity for smaller earlier stage cancers, and the digital had a comparable ROC curve area to screen film. We were able to reject the null hypotheses of the core performance of digital in each case.

For reader study No. 2, we got very similar results to reader study No. 1. Remember, this is based on all the cases being read by five MQSA-qualified radiologists reading both digital and film screen with the separation of at least 30 days between the readings of the two different modalities.

All cases analyzed had digital with a 2 percent lower recall rate than screen film and we were able to reject the null hypothesis again, and, if you looked at just all non-cancer cases, the same 2 percent difference with a strong rejection of the null hypothesis.

When we used these statistical methods that took into account the correlation now among the five different readers in terms of recall rate, we were still getting a 2Êpercent lower recall rate for digital compared to screen film and a highly significant rejection of the null hypothesis when you included all cases and a reasonable rejection, in terms of significance, of the rejection of the null hypothesis when we only included the cases where there was disagreement between the two modalities by a given reader in terms of recall.


In terms of sensitivity in reader study No. 2, digital had a 68 percent sensitivity. Screen film had a 70Êpercent sensitivity. But we were still able to reject the null hypothesis when we included--did the statistical evaluation without taking into account the correlation among readers or when we used all cases and took into account the correlation among the readers with this PROC MIXED method.


In terms of the cancer distribution in reader study No. 2, 61 percent of the cancers in reader study No. 2 were stage 0 or I so these were even a slightly better distribution toward earlier stage cancers and slightly better toward minimal cancers.

43 percent had cancers less than 1 centimeter in size in stage 0 and I meeting the AHCPR guidelines for this study as well.


In terms of the sensitivity of digital in stage 0 and I cancers, it was about exactly the same for digital and screen film, and for minimal cancers, digital did slightly better. But, again, this is a more limited number of cases on which these numbers are based.


The ROC curves are remarkably similar between digital and screen film in reader study No. 2. This is combining the results of all five readers in a breast-by-breast analysis. Not only are the areas the same within 0.001, the areas under the ROC curves, but the shapes of the ROC curves are virtually identical.

When we use these data to test to reject the null hypothesis, the digital has an ROC curve area lower than screen film by 0.1 or more. Unadjusted for the correlation between readers, we have a high degree of significance. Even adjusted for multiple readers, we have a high degree of significance in rejecting the null hypothesis. The digital has a significantly lower ROC curve area.

So, from these results, we can conclude, also in reader study No. 2, that digital is noninferior in terms of recall rate. It doesn't recall more women than screen film. Digital is noninferior in terms of sensitivity. It has a comparable sensitivity and it has virtually identical ROC curves to screen film as well.

We are able, statistically, to reject the null hypothesis of digital being worse than screen film in this study.


The side-by-side analysis was a different kind of analysis to look at how lesions appeared in digital images compared to screen-film images. We limited the case selection for the side-by-side analysis to the first 40Êcancer cases that were collected in the reader studies including some of the screening cancer cases.

So the readers sat with screen film on the viewboxes and the printed digital hard copy images on view boxes, looking at them side-by-side and used the Likert scale, which is a ranking scale with eleven points on it, to assess whether lesion conspicuity was better in one modality than another, whether there was more inclusion of tissue at the chest wall in one modality or another, or whether the visibility of tissue at the skin line was better in one modality or another.

Obviously, the most important of these is lesion conspicuity between the two modalities. But we also wanted to make sure that, in acquiring the digital images, that there was not a loss of tissue at the chest wall because of the digital detector design or some compromise in the appearance of tissue at the skin line because of the image acquisition.


The eleven-point Likert scale is shown here graphically. The five radiologists who did the side-by-side analysis could pick a score from 0 to 11. For example, on lesion conspicuity, if they thought the lesion was equally visible on both film screen and digital, they would give it a score of 5.

If they saw the lesion only in digital and not in film screen, they would give it a score of 0 and, if they saw it only in screen film and not in digital, they would give it a score of 10.

We found, from our results, that the radiologist did use the full range of this scale.


The null hypothesis was that screen film was better than digital in terms of each of these assessment areas by a score of one point or more on the Likert scale. Screen film being better is toward the high end of the scale, so the null hypothesis was that, in each of these areas, screen film, the score would be greater than or equal to 6. We tested against that.


The actual results; in terms of lesion conspicuity, the mean score was 5.17. This range is averaging over the 40 cancers. Actually, two views were scored separately for each of the 40 cancers. This is averaging over the 40 cancers and looking at the range of reviewers averaged over the 40 cancer cases.

The view range is looking at the range averaging over the five reviewers and looking at the range applied over the 40 cancer cases. So the fact that there is a 0 here means that all five radiologists gave it a score of 0, not just one of them, because this is an average over the five radiologists and the maximum score of 9.8 means that, on one case, four radiologists gave it a 10 and one radiologist gave it a 9 which would be strongly in favor of screen film.

So this is just to show that, in these different categories, generally the full range of scores was used. The fact that this number is less than 6 meets the criteria--the fact that each of these numbers is less than 6. One of the results that were pleased about is the score being significantly below 5.

Even when you look at the range over all the reviewers, each reviewer scored at less than 5 and almost every view was scored less than 5 for the visibility of tissue at the skin line. One of the explanations for that is that the images on digital were thickness equalized.

An algorithm was applied to the digital images that eliminated the thickness differences of the breast and only presented tissue consistency differences in the breast and it made it much easier for the radiologist to see to the skin line compared to screen film, even with hot lighting which was available for any of the images.


We also did a subgroup analysis of the side-by-side results for different types of lesions among the 40 cancers. This is looking at the number of views. Some lesions had both a mass sign and a calcification sign so they may be double counted here, but this shows that where calcification were present, the scores were similar to the mean score that we got for all lesions.

Really, there was no significant difference in the means for any particular type of lesion. The full range was used across these different types of lesions.


So, in conclusion, from the side-by-side analysis, we were able to show that, in a side-by-side comparison of screen film and hard-copy digital, that the readers saw the conspicuity of lesions to be the same. They saw the same amount of tissue at the chest wall and were actually much better to see skin line more easily with the digital presentation of the images.


So the study conclusions are that in both the reader studies, recall rates demonstrated fewer recalls with digital than with screen film. In both reader studies, the sensitivity of digital was comparable to that of screen film for the detection of breast cancer and, in both reader studies, the ROC analysis gave virtually identical ROC scores for the areas under the curve for digital compared to screen film.

In the side-by-side feature analysis, there were comparable lesion conspicuity and visibility of tissue at the chest wall with digital compared to screen film. Digital actually did significantly better for visibility of tissue at the skin line.


The final conclusions are that product labeling is consistent with the data presented in this PMA and the PMA, we think, presents a strong case for the safety and effectiveness of digital mammography for the detection of breast cancer, both for screening and diagnosis.


Let me just close by presenting a road map of where we go from here. What we have done so far is present the PMA data on the hard copy digital compared to screen film. Hopefully, with approval of hard copy digital, based on the data that have been presented, the next step will then be to go on and seek a soft copy--or perform a PMA supplement study that would validate soft copy digital by comparing soft copy presentation of digital images to hard copy presentation of digital images in a side-by-side comparison similar to the study that I presented here, comparing digital hard copy to film screen.

But this would be done in a side-by-side comparison of digital hard copy with digital soft copy. It would require at least 45 cancers, a total of 100 lesions and would be done by five qualified radiologists performing the side-by-side comparison.

Obviously, the manufacturers want to be able to use either hard copy or soft copy presentation of their digital images to be read by radiologists. So this would close the PMA, the premarket approval, step for soft copy. We have conferred with the FDA about the design of a postmarket study and we would like to at least present some idea of what the postmarket approval study might look like based on those.

Those discussions, the design that has come up, is the multiple reader, multiple case study which would use ROC analysis like the multi-reader analysis presented in reader studies here but would include more readers, somewhere between six and ten readers and would include more cancers, and all of them screening generated cancers.

I think the concern is how digital will perform in the screening cohort and this postmarket study would collect cases only from a screening cohort, would collect at least 50 cancers and then at least three to four times that number of non-cancers, so somewhere between a total of 200 and 250 images.

And these six to ten readers would read both the digital and screen-film images with a sufficient time separation in between to avoid recall effects. Those results would then be analyzed with multi-reader ROC methods to eliminate the correlation among the readers and compare the ROC results in this multi-reader, multi-case approach.

I think the FDA will be talking more about that in their presentation as well. So I will stop here and thank you very much for your attention.

DR. GARRA: Thank you, Dr. Hendrick.

We are running just slightly ahead of schedule so we could take one or two questions from the panel about Dr. Hendrick's presentation. Dr. Smathers?

DR. SMATHERS: Ed, as I understand the sequence, the film screen was done first and then, using the same radiographic techniques, the digital mammography was taken.

DR. HENDRICK: Yes; that is exactly right.

DR. SMATHERS: Were any of the recalls in film screen due to inadequate exposure of the film since that would prejudice that cohort to some extent.

DR. HENDRICK: No; that wasn't the reason for recall.

DR. SMATHERS: They were subtracted out or eliminated from the--

DR. HENDRICK: Yes; there was QC done on the quality of the screen-film images prior to the radiologist making the decision about whether it was a positive or a negative case. The recalls were only because they thought the women needed further evaluation to work up the findings.

DR. GARRA: Thank you.

Any other questions at this point?

DR. HARMS: Ed, what was the gold standard? How do you establish that? Is that biopsy, the size of the lesion? How was that determined?

DR. HENDRICK: The gold standard is the presence of cancer and that was determined by biopsy in the cases that got to biopsy through the diagnostic workup. There were, obviously, lots of cases that were read as normal on both modalities that didn't get the biopsy. The only way that we have to determine whether cancer occurs in those is to follow those women for at least a year after the study and see if cancer occurs.

So the study was conducted between October of '97 and January of '98, and follow up continues. But there was intense follow up through May of 1999 when MedTrials, who was monitoring this study, was collecting data and sort of hounding sites on a daily basis about, "Have there been any more cancers in the study group?"

That monitoring will continue but one of the things that we find in this kind of a study is that, because you are doing both modalities, and if one modality shows it to be suspicious, you are going to do something about it, that there is better ascertainment of the presence of cancer than in the normal just doing a single modality in these studies.

The ascertainment is not biopsy in every case, is the simple answer to your question, but biopsy plus follow up.

DR. GARRA: Any other questions? Some of us have questions but I think we are going to hold them until after we hear the FDA presentation. We will all have an opportunity to ask additional questions later on.

Thank you.

I think, at this point, what we are going to do is take a fifteen-minute break. It is now 10:15 and we will reconvene at 10:30 in the morning here.


DR. GARRA: Thanks everyone. We are now going to begin with the FDA presentations. The first speaker is going to be Jack Monahan who is the lead reviewer for this PMA.

1   2   3   4   5   6   7

The database is protected by copyright © 2016
send message

    Main page