ROC Analysis
DR. WAGNER: Good morning, panel, sponsor and guests.
[Slide.]
Our sponsor did a multiplereader ROC study. I would like to explain to you what these words mean and ask Dr. Toledano to indulge me. Dr. Toledano works at the frontier in this field.
Here is an outline of the presentation I have this morning. I will first talk about the ROC paradigm and the sources of variability that it controls for. And then I will just give you a quick flashing of two classic papers on the variability in mammography. These papers explain a lot of the predicament where we were in the last few years.
Then I will define what was meant by multiplereader, multiplecase ROC study and, in the jargon of the land, and many people just refer to this as a reader study, in the interest of saving three or four words. And, finally, we will get to the sponsor's multiplereader and multiplecase ROC analysis.
[Slide.]
Here is a onepage ROC tutorial. The ROC paradigm was invented to accommodate the situation where you have two populations, one disease, which is here, the cancer population, and another population that is nondiseased or the noncancer population. You would like to be able to separate the two populations with some kind of a diagnostic test.
If you would think in terms of prostate cancer, this decision axis isfor example, it could be PSA assay so there would be the test measurement. In diagnostic imaging, you don't have a nice scale and so people reporting in ROC analyses give a scale which is considered the probability of malignancy, the probability of disease or, in the jargon, the reader's subjective judgment of the probability or the likelihood that the case is a cancer.
So some people call this the probability of cancer, probability of malignancy or what have you. The idea is that you would like to separate the two populations and you would like to have a place where you could totally separate the cancers from the noncancers. Of course, as in all realworld problems, those two populations overlap quite a bit.
If you put your cut at a certain point, then all of those cancers to the right of your threshold would be true positives but then there will always be some noncancers that would leak past that threshold so the people from the noncancer that leak past that threshold are the false positives.
If you try to be more aggressive and to catch more cancers, we all know that that means you have to pay with more false positives. That tradeoff was just described by the ROC curve.
[Slide.]
What we don't see in the curve is the hidden parameter which is where you set the threshold. Where you set it is determined by what is called the reader's mindset or a level of aggressiveness.
So, as the reader gets more aggressive, you just move up the ROC curve and trace a figure something like I have shown here. When people start making measurements of sensitivity and specificity, you realize that it going to be very expensive to pin down the sensitivity and specificity at every point in the curve.
So, early in the study what people frequently do is they summarize the ROC curve by the area underneath the curve. When you do that, you are essentially giving the sensitivity averaged over all specificities so you are essentially replacing the curve with a line at the level of the area under the curve. So you have just reduced a nice dataset to a simple average number.
A test that is guessing has an area under the curve of 0.5. A perfect test would come up and hug the corner and would have an area of 1.0. So that is the simple paradigm of what ROC is about.
[Slide.]
Now let me move on to two of the classic papers on variability and ROC analysis. The first classic paper is from Joanne Elmore and company who studied ten radiologists not randomly selected. That is the point of this overhead. She reported on the wide range of patient management decisions on this number, nine cancers, and that number of cancers.
If you just look at that range of performance, it looks like it is all over the map. But Carl Dorsey and John Swets came along and took Joanne's data and showed that, at least for these ten radiologists, their performance straddled a model ROC curve. So what you were seeing in the variability seen in her study was a rather homogenous range of reader skill level because a low skill is here, a really good skill is up here.
So this is a rather homogeneous level of reader skill. What we are seeing is a difference in the mindset or the level of aggressiveness of those readers. This is one of the problems for agreement studies is that they would not control for that.
Enough on the Elmore study and its interpretation.
[Slide.]
Craig Beam, Peter Layde and Dan Sullivan went to great effort not just to select some readers but to select over 100 readers randomly from the population across the country. When you keep score in the same way, based on the recommendation for biopsy or not, if you were to look at ROC space, you would have a true positive, false positive. If you look in sensitivity space, you have the complement.
So Craig Beam put his data out in this way. Here we now see that the radiologist's performance is really all over the map. This is one of most celebrated figures. In fact, I have xeroxed this so many times it has come out as a parallelogram, as you can see. It is a very population figure.
So, here, we see among these 108 radiologists operating on these samples not only quite a range of variability of their mind set or level of aggressiveness, we also clearly see that there is a range from level of reader skill. I say "clearly;" I did not do this analysis. Craig Beam later came along and showed that this spread of performance is not consistent with the finite sample statistics of one single ROC curve. There really is a range of ROC curves.
[Slide.]
One more wrinkle I have to put you through before we go on to the sponsor's results. If you had one diagnostic testwe are here, today, to compare two diagnostic tests. If you had one diagnostic tests, you might get these two populations, schematically, and you might have another diagnostic test in which they are slightly different in the way the two populations overlap. I failed in doing that nicely.
But, to analyze the comparison of two modalities, you have to put the problem into two dimensions. Now, the new dimple, if you will, is that you now see the correlation between the two tests. The egg shape of the cancer population and the egg shape of the noncancer population is a measure of the correlation of patients across tests.
If you just squeeze the cancer population into a cigar, in that case, we have what we call 100 percent correlation across modalities. That would mean that, from the probability of malignancy from the one test as this as a cigar cloud, you could just go up from the probability in the one test and get the probability in the other test.
But we know that, in the real world, that it has been discovered in comparing digital to conventional mammography, this correlation is not high, particularly as Dr. Lewin suggested in the Diagnostic Imaging article, because of repositioning, at least for many of the modalities.
When you take the patient out of the room and into another room, repositioning, there is enough variability there that these clouds are not 100 percent correlated. In fact, you would know from following the literature and the public discussions, at least some of the ones we have had at the National Cancer Institute, that that correlation is less than 0.5.
So that is another problem for agreement studies.
[Slide.]
Now I will define what all those words meant up front. The multiplereader, multiplecase, ROC paradigm means the following: it means every reader reads every case and, where possible, reads every case in both modalities. When you do that, you can actually start to enter a more complicated world than we had up front.
Now you can start to do what is called multivariate ROC analysis. What that means is that you start to account for the variance due to the range of case difficulty in the patients and its finite sampling. You can get some feel for the variance due to the range of reader skills and its finite sampling.
You can get a feel for those egg shapes that I talked about a moment ago, the correlation of the case variance across modality and the correlation of reader variance across modality. You would do that with some other egg figures that look something like what I just showed you.
And then there is something that is called withinreader variability or reader jitter. When you ask the reader to get the probably of malignancy and the reader says, "Well, that is like 70 percent," and you come back a month later, it could be 30 percent. We call this radiologist jitter. I think Dr. Elzeraki or Dr. Destouet last time said that perhaps this could be called a rumble. This is not a subtle effect.
Most models actually involve more parameters than we have here but you can get a good feel from the ones I just mentioned. Now what you can do is, collecting data in this format, you can now use software that is available on the web from the University of Chicago. Dr. Toledano has developed software to solve this problem and our own group, Sergie Beiden, Greg Campbell and myself, have an algorithm and a paper on that. If anyone is interested, we can tell you how we solved this problem.
[Slide.]
Again, before we get to the sponsor's results, I just want to give you a feel for how the various variances play out. We are interested in comparing two modalities and so we will compare the difference in ROC areas between the two modalities.
We saw earlier today that the ROC curves, and I am talking about reader study 2, here, the ROC curves lie right on top of one another essentially for the two modalities, digital and analogue in that study, but there is something called sampling statistics. What is the sampling variability? How uncertain are we about the areas under the ROC curves.
To do that, it is actually not a trivial problem. The difference in the ROC areas between two modalities requires three pieces. I have it in the unexpurgated version. The panelists have this in your notes. You can study it on the plane tonight, if you like.
I have translated it and I have written "i.e.," here. Every school child today knows that "i.e." is Latin for "in English." In English, what contributes to your uncertainty and your ability to see the difference between two modalities has three pieces.
It has a piece that is inversely proportional to the number of cases. This is the piece that most people carry around in their gut, but that is not the whole story. There is a second piece that is inversely proportional to the number of readers, as you might expect, if you are going to start to average readers together.
And then there is third piece that is withinreader jitter or any remaining lack of experimental reproducibility. That scale is inversely with the product of the cases and the readers.
There is something really important for these first two terms, which is the uncorrelated part of the case variance. I am going to put you through a little exercise for a minute to explain what we mean by that.
Picture that you were on a shore and you have a laser and you are trying to measure the height of a mast on a ship. Suppose that ship is in very choppy waters. Well, you might think, at first, it is going to be very difficult to measure the bottom of the mast because it is very noisy and it is going to be very difficult to measure the top of the mast. But those two are 100 percent correlated.
So, actually, with a good enough laser, you can measure the height of that mast perfectly, almost perfectly, until it starts to get choppy and there are other sources of noise. That would enter, then, a random component. So what happens in these models if you only generate uncertainty from the uncorrelated part. The correlated part is in your favor.
So what I am trying to say to you is even though the reader variability can be very great, as it is in mammography, if the boats that the readers are on rise together, if the readers' digital and analogue rise together, they are pretty highly correlated and you may not have to pay an awful lot for that term.
That was the reader term. The same thing for the case term. But, remember, this is only of the order of 0.5 and this is the last term that comes in like the product.
[Slide.]
Finally let's get to the sponsor's study. You heard from Dr. Bushar just a few moments ago about this cohort. This is the reader study No. 2. There were 44 breasts with cancer, no known bilateral cancer. There were five readers, five MQSAqualified radiologists. All cases were imaged with both modalities.
Here is the essence of what a multiplereader, multiplecase study is. All readers read all images from both modalities. The readers in the study used what we call the quasicontinuous scale. They used the range from 0 to 100 for the probability of malignancy. That is sort of their test measurement readout scale sort of analogous to using a diagnostic clinical test.
The readings in digital and analogue were separated by 30 days to minimize the memory effect and there was a balance of the reading that you heard about earlier. Half of the cases were read digital first and half of them analogue first to try to minimize two other learning and memory sources of bias.
Now, I am going to give you the sponsor's results in two pieces; first, the easy piece that you heard about a little while ago. We are thinking now of the individual readers and uncertainties based on just readers one at a time. In a minute, we will put all the readers together. But, one at a timelet me just say this again.
We are going to average all five readers' ROC areas. When we do this for analogue, the areas were 0.77 on the average. The film screen, on the average, was 0.76. It is actually closer than 0.01 there. I am going to ignore that difference for a moment because the readers' uncertainty was of the order of 0.1, ten times that, so you didn't see that on those average curves we showed earlier.
So that is what some of the work going on here is about; do we want to live with a 0.01 uncertainty. If you took the five readers' error bars, they move around 0.1. If you average them together, the mean 95 percent confidence interval about that difference is plusorminus 0.11. So that is a question for society, whether this level of uncertaintyhow it strikes us.
When you go to the multiplereader analysis, now, the idea here is you would think you could just average all these scores together and you ought to be able to get the error bars for the average reader's ROC area. That turns out not to be an easy problem. That is why I went through that exercise.
People worked on that problem for a number of years, including ourselves and one of our panelists. But when you do that and, in this case, when you use the University of Chicago software, and it is available on the web, now the 95 percent confidence interval about the difference has been narrowed. It is down to plusorminus 0.064. Now we are starting to zero in on some kind of precision estimate of this difference.
We reproduced the sponsor's study with the Chicago software. We got the same results. We have our own software and I have some information on that in a paper if people are interested in how we do it. Our group has developed an independent algorithm. When we do the problem, we get plusorminus 0.068, almost the same result.
Another nice feature of our treatment is that we can tease out all those components of variance that I mentioned before, with some uncertainly, but this is what you do in a pilot study. You can look at that data and say, what do these results and the components of variance say about the size of a larger study that tried to narrow the error bars.
For example, if you wanted to narrow those error bars to plus or minus 0.05, here are the combinations that you would need if the patients you are about to sample from look like the patients they studied in the pilot study. You can see that, with 44 cancers in ten readers up to 59 cancers in five readers, with our current estimates, you could get the error bars down to about plusorminus 0.05.
Now, suppose people are uncomfortable with that. We are talking postapproval now. Suppose people are uncomfortable with that and they said, "We would really like to get it down to 0.01 or 0.02 or 0.03." If you tried to cut that to 0.03, the numbers go up very quickly. Now you need 78 cancers and 100 readers, or 100 cancers and 20 readers, especially when you realize that you get about five cancers per 1000 screened.
This study, then, is 10,000; is that right? And this study would be 20,000 people screened. So you can see how prohibitive these studies would become. But, perhaps, this study is within reach.
[Slide.]
So, in conclusion, the individual reader studies bring the error bars to the neighborhood of 0.1. The multiplereader study cut to about 0.06 to 0.07. We showed what you could do to get it down to 0.05 if one would like and the panelists have the references, and the last two references have a star; one is the Chicago software and the other is a paper written by my colleagues and myself.
Thank you very much.
DR. GARRA: Thank you, Bob. I am glad you provided me with some reading for the flight home tonight. I will have to read that, but I am going to be drinking at the same time, so I don't know.
We would like to go on to the next speaker which is Bill Sacks. He is going to be talking about labeling review and the proposed postmarket study.
Labeling Review and the Proposed Postmarket Study
DR. SACKS: Good morning, everyone.
[Slide.]
I am a radiologist and an exphysicist, although I don't know if there is such a thing as an exphysicist, with the Office of Device Evaluation in the Radiology Branch. I will be discussing three items.
[Slide.]
First, I will say a little bit in addition to what Ed Hendrick told you about the sidebyside comparison. Secondly, I will go into the error bars that we feel should be included in the labeling. And, thirdly, I will go into the company's proposal for their postapproval study.
[Slide.]
With regard to the sidebyside feature comparison, as Ed explained, it was based on 40 cases with cancer, biopsyproven cancer, in which the radiologists had in front of them at the same time the digital mammography and the filmscreen mammography on the same woman, and they were asked a series of three questions to judge these with respect to the conspicuity of the cancers which were marked on the films; secondly, the question of inclusion of tissue near the chest wall; and, thirdly, visibility of tissue near the skin line.
[Slide.]
That is a particular selection of features that were compared by the company. They are not the only ones that can be compared. This is just an example of two others. There are many. One might compare the ability to discriminate between benign and malignant calcifications. One might compare the ability to detect fine marginal irregularities of masses which would also relate to the question of whether they were benign or malignant. And there are others that one could come up with.
One of the things about a sidebyside feature analysis, of course, is anybody who has looked at these knows that it is impossible to hide which is the analogue and which is the digital mammography. There are certain appearances which indicate to you which is which.
Since this is not a blinded study, a certain amount of subjective bias can come into play. Just bear that in mind as we talk about this.
[Slide.]
This is the Likert scale that is the same picture that Ed showed. I just have it filled in with the points here. And bear in mind that when you are looking at this sidebyside pair of mammograms on the same woman, if you feel that the digital is better with respect to the index that you are looking at, you give it a lower number. If you feel that analogue is better, you give it a higher number.
The extremes can either represent not visible at all on the other film or simply much better seen. Clearly, 5, being right smack in the middle, means that, well, as far as I am concerned, each one is as good as the other.
[Slide.]
This is the results in tabular form. I will show them in graphic form in a second with respect to these three indices. The conspicuity of cancerthese are figures that you have seen before, today. The average was about 5.17 meaning just slightly to the side of analogue or film screen being better.
As far as inclusion near the chest wall, again, it is very close to 5. With visibility near the skin line, it is actually much closer to 0 which is in digital's favor, as you will remember from that scale. I will talk about these ranges, but it is easier on the next slide because this is the same information in pictorial form.
[Slide.]
The range of each of these lines is the range of readers. It has all been averaged overeach reader had had their reading averaged over all 40 cancers so that with respect to conspicuity of cancer, there was one reader down here. There was one reader up here. And the other three fell in the middle. That is all that means. And the average came out to be just barely above 5.
Again with inclusion of tissue within the chest wall, there was one reader down at that extreme, another at that extreme, because these are ranges. These are not standard deviations or anything. It is just the range over the five readers.
And then with visibility near the skin line, again, there was one that was way down here, one that was here and, on average, they came out at 2.95, as we saw. It is significant that all of the readers felt that, on average, the films were better on digital with regard to visibility near the skin line which would surprise nobody who has ever looked at a mammogram. They tend to be very dark near the skin line and the dynamic range that you have seen in both the company and the FDA's presentation on the physics shows the tremendous dynamic range that digital has. That is one of its major advantages over analogue.
[Slide.]
This can be broken down, again data that you have seen, with regard to the particular sign of cancer; that is, whether it is calcifications. Here this is broken down even farther than you saw before. This is whether calcifications were present or whether calcifications were the primary way that this cancer was identified, and so on.
Again, it is striking that all of these are just above 5 but, essentially, right in the middle. The rangenow, this is not a range from one reader to the next but a range from one view to the next averaged over the five readersin other words, there was one mammogram, at least, and again this is a range; this is not a plusorminus a standard deviation or confidence interval.
There was at least one mammogram down here where none of the radiologists could see it on the analogue film at all. The only way you can average 0 is to have every one of them essentially giving you a 0, particularly when you are just using integralwell, maybe one may have gone as high as 1, but basically, they all thought that the digital was much better and, probably, that represented a case where it wasn't visible on the analogue.
You have got another mammogram, at least one, that was at the high end where they thought the analogue showed the cancer much better, or the calcifications, in this case. Similarly, as you go down masses, the range goes from fairly low to fairly high, again architectural distortion, fairly low to fairly high.
That means that some mammograms were much betterthe cancer was much better seen by this sign on the digital and others in which the cancer was much better seen on the other.
Perhaps, this range can be explained by what was found in John Lewin's article in the November Diagnostic Imaging that Dr. Wagner mentioned in which he pointed out that repositioning causes twothirds of the reason for the variation between the way the analogue and the digital look which suggests that if you were to repeat every woman's analogue mammogram, even if digital didn't exist, you would pick up quite a bit more cancer.
So if everybody came in to get a mammogram once a year, if you did two copies of each view, the sensitivity might go from roughly 80 percent, as is said for mammography, up to maybe 90 percent which is about the same advantage you get if you use a second reader on one set of films, and so on.
There has been a paper in the literature recentlyDr. Kopans was one of the authorsmodeling what would happen if you did mammograms more frequently like every six months or every three months, and so on, and the sensitivity goes tremendously close to 100 as you get down towards every three months.
One comment I would make in answer to something that was said earlier today that the idea of double exposing women in these trials may not be ethical. If you take into account what I just said, there is a definite benefit to that risk of everybody having both an analogue and a digital mammogram at the same time because they are read and they do, in this kind of a trial, determine the woman's follow up and care.
[Slide.]
The second issue that I want to discuss is what error bars should go in the labeling.
[Slide.]
Now I, like Dr. Bushar, am showing only the data from reader study No. 2. There are the three indices that we feel are important here; that is, the area under the ROC curves for FFDM, which means fullfield digital mammography, and this is screenfilm mammography sensitivity and specificity with regard to the dichotomous decision, does this woman need to come back for anything based on these four views as though, as Dr. Hendrick explained, being looked at as though this was a screening mammogram with no other information.
One can deal with sensitivity and specificity at later stages, as we will see in a minute. Almost half of the mammograms were read as BIRADS 0 which means "needs further imaging" to the point where I can't even make a decision whether to assign this a BIRADS 1, 2, 3 or 5. Once you do that further imaging and you make that determination, then you could do sensitivity at other cutpoints; for example, at the BIRADS 3 cutpoint which would be where a woman can either come back next year, there is nothing wrong at all, or if they are 3 or above, that means, well, there is low probability of malignancy but I want to see here again in six months and repeat the mammogram.
That is one possible cutpoint. The next reasonable cutpoint would be at the BIRADS 4 level which would make the decision between, this woman I am recommending for a biopsy, versus, she doesn't need a biopsy. So sensitivity and specificity, in this particular study, the only thing measured was at the original fourview that was treated as though it were a screening study and, therefore, was separated just into negative and positive with respect to, does anything else need to be done even as minor as having her come back next Tuesday for a repeat, say, spot magnification view.
Now, given that, the ROC area for digital was 0.758 and 0.767 for film screen which gives a very small difference. We have seen these figures before. The error bars on thisI have highlighted the worst case. This is what we feel needs to go into the labeling. Based on the data here, and the numbers of women involved, the numbers of cancers, the numbers of noncancers, this point estimate for the difference which makes it look trivial actually could be as bad as 0.07 less or it could be as good as 0.05 better for digital.
A wide range like that means that, of course, a point estimate can be very misleading. For the nonstatistician, you would like to ignore error bars and the rest of that complication and just look at point estimates but the fact of the matter is that this information is compatible and, even here, is an arbitrary cutpoint. But it could be compatible with the usual standard of significance with an area under the digital ROC curve which is as much as 0.07 below that of the ROC curve for screen film.
With sensitivity, if you look at the point estimates, digital was 68.18, the screen film, 69.55 sensitivity with a difference of 1.3 which looks trivial but, again, because of small numberssensitivity always deals with the cancers and specificity always deals with the noncancers, in this case disease or nondiseasethe small number of cancers, only 44 cancers in reader study No. 2, gives a fairly wide range.
What that means is that, while these point estimates look close enough to say, "oh, well, that is no problem; they are obviously equivalent," the fact of the matter is they are compatible with a sensitivity for digital that is as much as almost 10 percent lower than that of screen film.
Now, one has only to think about the fact that 25Êmillion women are screened each year in the United States with about 180,000 cancers found in the last few years, anyway, each year, and 10 percent smaller sensitivity can mean a lot of cancers missed.
On the other hand, of course, it is also compatible with digitals possibly being 7 percent higher. So anywhere in between there, the fact is, we just don't know, based on these figures and bear that in mind as we go on to the next topic which will be the postmarket study.
With regard to specificity, again, digital and analogue were very close, 1.89, although, in this case, digital was better. They had a better specificity; that is, they had a lower recall rate for the noncancers by a small amount. Here, the worstcase scenario is that it could have had a specificity only 0.58 less than that of the analogue. So this is, actually, a better range.
[Slide.]
Finally, in talking about the postapproval study proposal, the first thing I want to talk about is, then, just to summarize, why is it that the FDA is requiring a postapproval study on a PMA such as this one?
There are two broad reasons. One is that the modest size of the study in this PMA which, as I just showed you, gives fairly broad confidence intervals on the difference between digital and screen film, in particular with respect to ROC area and sensitivity which are two very important issues, but, secondly, and possibly even more important, a study that is performed in part on a diagnostic cohort, and this was primarily a diagnostic cohort, although the cancers were almost equally drawn from the screening study, a separate screening study and a diagnostic study, introduces a potential bias, let's say, of case mix towards larger, more advanced cancers.
For example, women who come to a diagnostic clinic come for one of two broad reasons, either because they have some symptom, such as a palpable lump which, I think, is most of them, maybe a nipple discharge, something like that, but a large number of them have a palpable lump.
In order for a lump to be palpable, it already has to be about a 1centimeter size cancer and that already takes it out of the range of the kinds of things that mammography can be the first to detect down at the 1 to 2 millimeter range. So there is a bias towards larger, more advanced cancers.
It may not test digital's ability with respect to the smaller, earlier, more curable cancers, although I will show in the data, in a minute, that it was surprisingly well distributed.
[Slide.]
First of all, let's just talk about the distribution of the cancers with regard to the BIRADS categoriesnot the cancers, but all of them. I am just going to base this on the analogue. It would be very similar if I did it just on the digital readings.
This is the BIRADS category, 1, 2, 3, 4, 5, 0. The first column here is the distribution with regards to these BIRADS categories of the analogue mammograms in the PMA by their category. What this means is that 50 percent of them were in the BIRADS 1 and 2 category. 47 percent, almost the other half, were in the BIRADS 0 category which I mentioned a minute ago. And there was a scattering, a small number, in the BIRADS 3, 4 and 5 category.
Just to give you a sense in a screening population to show the difference between what is partially a diagnostic population and a screening population, just to get a sense of a little bias here, the kinds of figuresthere is, perhaps, a wider range than I have given here but these is fairly representative figures from a couple of papers that, in a screening population, you can expect that about 90 to 93 percent, somewhere in that range, will be BIRADS 1s and 2s.
The initial assignation of BIRADS 0s will range, it depends on the centersome go as far as 5 percent, perhaps, some as high as 15 percent, but somewhere in the 8 to 12Êrange is what you will get in a screening population. But those will, ultimately, once the woman does come back for further imaging, whether it be extra mammographic views and/or ultrasound, every one of them will be redistributed among the 1, 2, 3, 4, 5 category.
So these numbers actually represent that final after they have been through the added evaluation. You get figures for BIRADS 3 that is on the order of 3 to 4 percent of the total. For BIRADS 4, about another 3 to 4 percent and for BIRADS 5, maybe 0.3, 0.5 percent, somewhere in there. This just gives you a flavor for the figures.
You can see that the BIRADS 5 category, even in the study as it was is fairly close to the range that you will get. The BIRADS 1 and 2 category was only about half and the BIRADS 3 and 4, which could be considered the more difficult mammographic casesthese are the subtle ones where, gee, you don't know quite what to do. You are always sitting there thinking, do I need her to come back in six months or should I recommend a biopsy.
A lot of women that you recommend a biopsy on, you know have a fairly low suspicion of probability of malignancy but it is high enough that you really don't want to risk waiting six months. These are the more difficult cases. You can see that they are underrepresented in this partially diagnostic cohort by a factor of maybe 3 or 4.
So this is one of the biases that is introduced by using a partially diagnostic cohort which is why we want to see, in a postmarketing study, a study done in a screening population.
[Slide.]
We actually are able to break down the 44 cancers that were included in reader study No. 2 into those that were derived from the diagnostic cohort which was 24 of them and those that were derived from the screening cohort which was 20 of them with respect to size.
As you can see, the numbers here are small, but twothirds of the diagnostic cancers were greater than a centimeter in size and only 45 percent of the screening cohort were over 1 centimeter.
These numbers, in fact, the difference here is not statistically significant if you do the appropriate tests. The error bars are very large because the numbers are small. It just gives you a flavor of the kind of trend that one might reasonably expect.
[Slide.]
Another way to break these down is by the stage of cancer. Ed Hendrick showed you the figures that actually combined these two. He gave you the sum of the second and third row. If you break it out into the diagnostic cohort and the screening cohort, you can see, again, and I will preface this by saying, again, there is no statistical significance here in the difference between this second row and the third, again because the figures are small, the numbers are small.
But you get, again, a sense of the trend one would reasonably expect and that is if you look at stage III and IV, for example, you have got about 12 percent of the cancers here and, in the screening cohort, you have only got about 5 percent. Again, and I don't want to make too much of this, I am just trying to illustrate the fact that in a diagnostic cohort you might expect that kind of trend, that there would be a shift toward the higher stage cancers away from the lower stage, although if you look at the curable 0 and I stages, they are 58 percent of the diagnostic, 65Êpercent in the screening cohort, not very different.
As a matter of fact, there is a surprising similarity here. One would expect even more of a bias in a diagnostic cohort, but, again, the numbers are small and, again, this is one reason why we would like to see this in a screening population.
Now, I just want to make one other point. Ed Hendrick showed you the figures. If you just looked at the sensitivity, and I don't have a slide on this because I haven't seen those figures beforeif you just look at the sensitivity on the stage 0 and I's, he showed a sensitivity for digital that was about twelve points, eleven or twelve points, higher for the digital than for the analogue.
Certainly, those are the most important cancers for mammography to find. If you are finding the 0s and Is, you are able to cure. If you are finding the IIIs and IVs, the cure rate is much, much lower. So what we would like to see, and he pointed out, is that the AHCPR, the Agency for Health Care Policy Research, likes to recommend that you like to see at least 50 percent of your cancers in these two lowest stages in any good screening study.
In fact, this was exceeded in this diagnostic cohort, a little more so in the screening cohort, but that is still to the good. But the point is that the numbers here are very small, but to see 12 percentage points higher if we didn't see the error bars there, again, this is the question that we have; the error bars may be very broad and we don't know, while that point estimate may be encouraging, again point estimates can always be misleading unless you have much tighter error bars.
[Slide.]
Finally, the proposed study design by the company in broad outline involves a screening population, as I have said is necessary. They do propose to double expose every subject to both analogue and digital mammography. This is very important to avoid a selection bias. Some studies have enriched by exposing everybody to analogue and then taking all of the say, BIRADS 4s and 5s, or even 3s, 4s and 5s, and then double expose those and only take a random subselection of the 1s and 2s which, you will remember, was 90 to 93Êpercent, to sort of match that number.
When you do that, you don't give digital a chance to show that it can pick up the small cancers that the analogue happened to miss and assigned to BIRADS 1 and 2. So, by exposing every subject, you do avoid that kind of selection bias.
Their analysis, they propose to show noninferiority again in these three important indicesthat is, ROC area, sensitivity and specificity. Of course, we will have to discuss further with them the question of at which cutpoints.
Now, I have hard copy written down here because that was, in fact, what was proposed in the hard copy of the PMA that we had although the company has already mentioned today and, in discussions with us a couple of weeks ago, or last week, I think, we have discussed the idea that a sidebyside comparison analysis of hard copy to soft copy, if that is approved ahead of time, then there is no reason not to use soft copy in the postapproval study.
Finally, the propose to analyze all of the cancers that they find. Again, these are groundtruth cancers, cancers based on biopsy or a cancer turning up a year later through a year of follow up and only a random selection of the noncancers. Now, that does not introduce the selection bias that I described up here because you have already got the ground truth and you are selecting not on how the analogue looked but whether the woman really has cancer or doesn't have cancer.
While there are some details yet to be worked out, in broad outline, this is an acceptable study design.
Thank you.
DR. GARRA: Thank you.
It is seven minutes to 12:00. I guess we can entertainif there are any clarification points that need to be made from the last several presentations, we can take a couple of questions on those. We will hold questions that deal with the substantial nature of the PMA until the discussion session after lunch.
Okay. Not seeing any panel members that want to ask any questions at this point, then what we will do is break for lunch at this point. We will do an hour for lunch and plan to be back here at about five minutes to 1:00.
Thank you very much.
[Whereupon, at 11:55 a.m., the proceedings were recessed to be resumed at 12:55 p.m.]
A F T E R N O O N S E S S I O N
[1:05 p.m.]
DR. GARRA: I would like to call the meeting back to order. I would remind the observers of the meeting that, while this portion of the meeting is open to public observation, public attendees may not participate unless specifically requested by the Chair.
We will continue the meeting with the panel's discussion of the PMA that will be led by Dr. Destouet. Judy, are you all set?
