Xavablogtag:xavier.robin.name,2010-05-28:/fr/feed2023-12-07T14:59:57.139117000+01:00daily2https://xavier.robin.name/fr/img/favicon.icoXavier Robinhttps://xavier.robin.name/fr/contactpROC 1.18.0tag:xavier.robin.name,2023-11-02:/blog/2023/11/02/proc-1.18.52023-11-02T17:01:20+01:002023-11-02T17:01:20+01:00<p>pROC 1.18.5 is now available on CRAN. It's a minor bugfix release:</p>
<ul>
<li>Fixed formula input when given as variable and combined with <code>with</code> (<a href="https://github.com/xrobin/pROC/issues/111">issue #111</a>)</li>
<li>Fixed formula containing variables with spaces (<a href="https://github.com/xrobin/pROC/issues/120">issue #120</a>)</li>
<li>Fixed broken grouping when <code>colour</code> argument was given in <code>ggroc</code> (<a href="https://github.com/xrobin/pROC/issues/121">issue #121</a>)</li>
</ul>
<p>You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>Deep Learning of MNIST handwritten digitstag:xavier.robin.name,2022-06-11:/blog/2022/06/11/deep-learning-of-mnist-handwritten-digits2022-06-11T16:39:25+02:002022-06-11T16:39:25+02:00<p>In this document I am going create a video showing the training of the inner-most layer of Deep Belief Network (DBN) using the MNIST dataset of handwritten digits. I will use our <code>DeepLearning</code> R package that implements flexible DBN architectures with an object-oriented interface.</p>
<h2>MNIST</h2>
<p>The MNIST dataset is a database of handwritten digits with 60,000 training images and 10,000 testing images. <a href="https://en.wikipedia.org/wiki/MNIST_database" title="Wikipedia: MNIST database">You can learn everything about it on Wikipedia</a>. In short, it is the go-to dataset to train and test handwritten digit recognition machine learning algorithms.</p>
<p>I made an R package for easy access, named <code>mnist</code>. The easiest way to install it is with <code>devtools</code>. If you don't have it already, let's first install <code>devtools</code>:</p>
<pre>
if (!require("devtools")) {install.packages("devtools")}
</pre>
<p>Now we can install <code>mnist</code>:</p>
<pre>
devtools::install_github("xrobin/mnist")
</pre>
<h2>PCA</h2>
<p>In order to see what the dataset looks like, let's use PCA to reduce it to two dimensions.</p>
<pre>
pca <- prcomp(mnist$train$x)
plot.mnist(
prediction = predict(pca, mnist$test$x),
reconstruction = tcrossprod(
predict(pca, mnist$test$x)[,1:2], pca$rotation[,1:2]),
highlight.digits = c(72, 3, 83, 91, 6688, 7860, 92, 1, 180, 13))
</pre>
<p><img src="/files/blog/2022/06/11/pca.png" style="max-width: 100%"></p>
<p>Let's take a minute to describe this plot.
The central scatterplot shows first two components of the PCA of all digits in the test set.
On the left hand side, I picked 10 representative digits from the test set to highlight, which are shown as larger circles in the central scatterplot.
On the left are the "reconstructed digits", which were reconstructed from the two first dimensions of the PCA. While we can see some digit-like structures, it is basically impossible to recognize them.
We can see some separation of the digits in the 2D space as well, but it is pretty weak and some pairs cannot be distinguished at all (like 4 and 9).
Of course the reconstructions would look much better had we kept all PCA dimensions, but so much for dimensionality reduction.</p>
<h2>Deep Learning</h2>
<p>Now let's see if we can do better with Deep Learning. We'll use a classical Deep Belief Network (DBN), based on Restricted Boltzmann Machines (RBM) similar to what Hinton described back in 2006 (Hinton & Salakhutdinov, 2006). The training happens in two steps: a pre-training step with contrastive divergence stochastic gradient descent brings the network to a reasonable starting point for a more conventional conjugate gradient optimization (hereafter referred to as fine-tuning).</p>
<p>I implemented this algorithm with a few modifications in an R package which is available on GitHub. The core of the processing is done in C++ with <a href="https://cran.r-project.org/web/packages/RcppEigen/index.html">RcppEigen</a> (Bates & Eddelbuettel, 2013) for higher speed. Using <code>devtools</code> again:</p>
<pre>
devtools::install_github("xrobin/DeepLearning")
</pre>
<p>We will use this code to train a 5 layers deep network, that reduces the digits to an abstract, 2D representation. By looking at this last layer throughout the training process we can start to understand how the network learns to recognize digits. Let's start by loading the required packages and the MNIST dataset, and create the DBN.</p>
<pre>library(DeepLearning)
library(mnist)
data(mnist)
dbn <- DeepBeliefNet(Layers(c(784, 1000, 500, 250, 2),
input = "continuous", output = "gaussian"),
initialize = "0")
</pre>
<p>We just created the 5-layers DBN, with continuous, 784 nodes input (the digit image pixels), and a 2 nodes, gaussian output. It is initialized with 0, but we could have left out the <code>initialize</code> to start from a random initilization (Bengio <i>et al.</i>, 2007). Before we go, let's define a few useful variables:</p>
<pre>
output.folder <- "video" # Where to save the output
maxiters.pretrain <- 1e6 # Number of pre-training iterations
maxiters.train <- 10000 # Number of fine-tuning iterations
run.training <- run.images <- TRUE # Turn any of these off
# Which digits to highlight and reconstruct
highlight.digits = c(72, 3, 83, 91, 6688, 7860, 92, 1, 180, 13)
</pre>
<p>We'll also need the following function to show the elapsed time:</p>
<pre>
format.timediff <- function(start.time) {
diff = as.numeric(difftime(Sys.time(), start.time, units="mins"))
hr <- diff%/%60
min <- floor(diff - hr * 60)
sec <- round(diff%%1 * 60,digits=2)
return(paste(hr,min,sec,sep=':'))
}
</pre>
<h2>Pre-training</h2>
<p>Initially, the network is a stack of RBMs that we need to <em>pre-train</em> one by one. Hinton & Salakhutdinov (2006) showed that this step is critical to train deep networks. We will use 1000000 iterations (<code>maxiters.pretrain</code>) of contrastive divergence, which takes a couple of days on a modern CPU. Let's start with the first three RBMs:</p>
<h3>First three RBMs</h3>
<pre>
if (run.training) {
sprintf.fmt.iter <- sprintf("%%0%dd", nchar(sprintf("%d", maxiters.pretrain)))
mnist.data.layer <- mnist
for (i in 1:3) {
</pre>
<p>We define a <code>diag</code> function that will simply print where we are in the training. Because this function will be called a million times (<code>maxiters.pretrain</code>), we can use <code>rate = "accelerate"</code> to slow down the printing over time and save a few CPU cycles.</p>
<pre>
diag <- list(rate = "accelerate", data = NULL, f = function(rbm, batch, data, iter, batchsize, maxiters, layer) {
print(sprintf("%s[%s/%s] in %s", layer, iter, maxiters, format.timediff(start.time)))
})
</pre>
<p>We can get the current RBM, and we will work on it directly. Let's save it for good measure, as well as the current time for the progress function:</p>
<pre>
rbm <- dbn[[i]]
save(rbm, file = file.path(output.folder, sprintf("rbm-%s-%s.RData", i, "initial")))
start.time <- Sys.time()
</pre>
<p>Now we can start the actual pre-training:</p>
<pre>
rbm <- pretrain(rbm, mnist.data.layer$train$x,
penalization = "l2", lambda=0.0002, momentum = c(0.5, 0.9),
epsilon=c(.1, .1, .1, .001)[i], batchsize = 100, maxiters=maxiters.pretrain,
continue.function = continue.function.always, diag = diag)
</pre>
<p>This can take some time, especially for the first layers which are larger. Once it is done, we predict the data through this RBM for the next layer and save the results:</p>
<pre>
mnist.data.layer$train$x <- predict(rbm, mnist.data.layer$train$x)
mnist.data.layer$test$x <- predict(rbm, mnist.data.layer$test$x)
save(rbm, file = file.path(output.folder, sprintf("rbm-%s-%s.RData", i, "final")))
dbn[[i]] <- rbm
}
</pre>
<h3>Last RBM</h3>
<p>This is very similar to the previous three, but note that we save the RBM within the <code>diag</code> function. We could generate the plot directly, but it is easier to do it later once we have some idea about the final axis we will need. Please note the <code>rate = "accelerate"</code> here. You probably don't want to save a million RBM objects on your hard drive, both for speed and space reasons.</p>
<pre>
rbm <- dbn[[4]]
print(head(rbm$b))
diag <- list(rate = "accelerate", data = NULL, f = function(rbm, batch, data, iter, batchsize, maxiters, layer) {
save(rbm, file = file.path(output.folder, sprintf("rbm-4-%s.RData", sprintf(sprintf.fmt.iter, iter))))
print(sprintf("%s[%s/%s] in %s", layer, iter, maxiters, format.timediff(start.time)))
})
save(rbm, file = file.path(output.folder, sprintf("rbm-%s-%s.RData", 4, "initial")))
start.time <- Sys.time()
rbm <- pretrain(rbm, mnist.data.layer$train$x, penalization = "l2", lambda=0.0002,
epsilon=.001, batchsize = 100, maxiters=maxiters.pretrain,
continue.function = continue.function.always, diag = diag)
save(rbm, file = file.path(output.folder, sprintf("rbm-4-%s.RData", "final")))
dbn[[4]] <- rbm
</pre>
<iframe src="https://www.youtube.com/embed/3EapaWpDqGQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen style=" width: 100%; aspect-ratio: 16/9;"></iframe>
<p>If we were not querying the last layer, we could have pre-trained the entire network at once with the following call:</p>
<pre>
dbn <- pretrain(dbn, mnist.data.layer$train$x,
penalization = "l2", lambda=0.0002, momentum = c(0.5, 0.9),
epsilon=c(.1, .1, .1, .001), batchsize = 100,
maxiters=maxiters.pretrain,
continue.function = continue.function.always)
</pre>
<h3>Pre-training parameters</h3>
<p>Pre-training RBMs is quite sensitive to the use of proper parameters.
With improper parameters, the network can quickly go crazy and start to generate infinite values. If that happens to you, you should try to tune one of the following parameters:
<ul>
<li><code>penalization</code>: this is the penalty of introducing or increasing the value of a weight. We used L2 regularization, but <code>"l1"</code> is available if a sparser weight matrix is needed.</li>
<li><code>lambda</code>: the regularization rate. In our experience 0.0002 works fine with the MNIST and other datasets of similar sizes such as cellular imaging data. Too small or large values will result in over- or under-fitted networks, respectively.</li>
<li><code>momentum</code>: helps avoiding oscillatory behaviors, where the network oscillate between iterations. Allowed values can range from 0 (no momentum) to 1 (full momentum = no training). Here we used an increasing gradient of momentum which starts at 0.5 and increases linearly to 0.9, in order to stabilize the final network without compromising early training steps.</li>
<li><code>epsilon</code>: the learning rate. Typically, 0.1 works well with binary and continuous output layers, and must be decreased to around 0.001 for gaussian outputs. Too large values will drive the network to generate infinities, while too small ones will slow down the training.</li>
<li><code>batchsize</code>: larger batch sizes will result in smoother but slower training. Small batch sizes will make the training "jumpy", which can be compensated by lower learning rates (epsilon) or increased momentum.</li>
</ul>
<h2>Fine-tuning</h2>
<p>This is where the real training happens. We use conjugate gradients to find the optimal solution. Again, the <code>diag</code> function saves the DBN. This time we use <code>rate = "each"</code> to save every step of the training. First we have way fewer steps, but also the training itself happen at a much more stable speed than in the pre-training, where things slow down dramatically.
</p>
<pre>
sprintf.fmt.iter <- sprintf("%%0%dd", nchar(sprintf("%d", maxiters.train)))
diag <- list(rate = "each", data = NULL, f = function(dbn, batch, data, iter, batchsize, maxiters) {
save(dbn, file = file.path(output.folder, sprintf("dbn-finetune-%s.RData", sprintf(sprintf.fmt.iter, iter))))
print(sprintf("[%s/%s] in %s", iter, maxiters, format.timediff(start.time)))
})
save(dbn, file = file.path(output.folder, sprintf("dbn-finetune-%s.RData", "initial")))
start.time <- Sys.time()
dbn <- train(unroll(dbn), mnist$train$x, batchsize = 100, maxiters=maxiters.train,
continue.function = continue.function.always, diag = diag)
save(dbn, file = file.path(output.folder, sprintf("dbn-finetune-%s.RData", "final")))
}
</pre>
<iframe src="https://www.youtube.com/embed/wSfoZ_kMMTc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen style=" width: 100%; aspect-ratio: 16/9;"></iframe>
<p>And that's it, our DBN is now fully trained!</p>
<h2>Generating the images</h2>
<p>Now we need to read in the saved network states again, pass the data through the network (<code>predict</code>) and save this in HD-sized PNG file.</p>
<p>The first three RBMs are only loaded into the DBN</p>
<pre>
if (run.images) {
for (i in 1:3) {
load(file.path(output.folder, sprintf("rbm-%d-final.RData", i)))
dbn[[i]] <- rbm
}
</pre>
<p>The last RBM is where interesting things happen.</p>
<pre>
for (file in list.files(output.folder, pattern = "rbm-4-.+\\.RData", full.names = TRUE)) {
print(file)
load(file)
dbn[[4]] <- rbm
iter <- stringr::str_match(file, "rbm-4-(.+)\\.RData")[,2]
</pre>
<p>We now predict and reconstruct the data, and calculate the mean reconstruction error:</p>
<pre>
predictions <- predict(dbn, mnist$test$x)
reconstructions <- reconstruct(dbn, mnist$test$x)
iteration.error <- errorSum(dbn, mnist$test$x) / nrow(mnist$test$x)
</pre>
<p>Now the actual plotting. Here I selected <code>xlim</code> and <code>ylim</code> values that worked well for my training run, but your mileage may vary.</p>
<pre>
png(sub(".RData", ".png", file), width = 1280, height = 720) # hd output
plot.mnist(model = dbn, x = mnist$test$x, label = mnist$test$y+1, predictions = predictions, reconstructions = reconstructions,
ncol = 16, highlight.digits = highlight.digits,
xlim = c(-12.625948, 8.329168), ylim = c(-10.50657, 13.12654))
par(family="mono")
legend("bottomleft", legend = sprintf("Mean error = %.3f", iteration.error), bty="n", cex=3)
legend("bottomright", legend = sprintf("Iteration = %s", iter), bty="n", cex=3)
dev.off()
}
</pre>
<p><img src="/files/blog/2022/06/11/rbm-4-final.png" style="max-width: 100%"></p>
<p>We do the same with the fine-tuning:</p>
<pre>
for (file in list.files(output.folder, pattern = "dbn-finetune-.+\\.RData", full.names = TRUE)) {
print(file)
load(file)
iter <- stringr::str_match(file, "dbn-finetune-(.+)\\.RData")[,2]
predictions <- predict(dbn, mnist$test$x)
reconstructions <- reconstruct(dbn, mnist$test$x)
iteration.error <- errorSum(dbn, mnist$test$x) / nrow(mnist$test$x)
png(sub(".RData", ".png", file), width = 1280, height = 720) # hd output
plot.mnist(model = dbn, x = mnist$test$x, label = mnist$test$y+1, predictions = predictions, reconstructions = reconstructions,
ncol = 16, highlight.digits = highlight.digits,
xlim = c(-22.81098, 27.94829), ylim = c(-17.49874, 33.34688))
par(family="mono")
legend("bottomleft", legend = sprintf("Mean error = %.3f", iteration.error), bty="n", cex=3)
legend("bottomright", legend = sprintf("Iteration = %s", iter), bty="n", cex=3)
dev.off()
}
}
</pre>
<p><img src="/files/blog/2022/06/11/dbn-finetune-final.png" style="max-width: 100%"></p>
<h2>The video</h2>
<p>I simply used <a href="https://ffmpeg.org/">ffmpeg</a> to convert the PNG files to a video:</p>
<pre>
cd video
ffmpeg -pattern_type glob -i "rbm-4-*.png" -b:v 10000000 -y ../rbm-4.mp4
ffmpeg -pattern_type glob -i "dbn-finetune-*.png" -b:v 10000000 -y ../dbn-finetune.mp4
</pre>
<p>And that's it! Notice how the pre-training only brings the network to a state similar to that of a PCA, and the fine-tuning actually does the separation, and how it really makes the reconstructions accurate.</p>
<h2>Application</h2>
<p>We used this code to analyze changes in cell morphology upon drug resistance in cancer. With a 27-dimension space, we could describe all of the observed cell morphologies and predict whether a cell was resistant to ErbB-family drugs with an accuracy of 74%. The paper is available in Open Access in Cell Reports, DOI <a href="https://doi.org/10.1016/j.celrep.2020.108657" title="Deep neural networks identify signaling mechanisms of ErbB-family drug resistance from a continuous cell morphology space">10.1016/j.celrep.2020.108657</a>.
<h2>Concluding remarks</h2>
<p>In this document I described how to build and train a DBN with the <code>DeepLearning</code> package. I also showed how to query the internal layer, and use the generative properties to follow the training of the network on handwritten digits.</p>
<p>DBNs have the advantage over Convolutional Networks (CN) that they are fully generative, at least during the pre-training. They are therefore easier to query and interpret as we have demonstrated here.
However keep in mind that CNs have demonstrated higher accuracies on computer vision tasks, such as the MNIST dataset.</p>
<p>Additional algorithmic details are available in the <code>doc</code> folder of the DeepLearning package.</p>
<h2>References</h2>
<dl>
<dt>Our paper, 2021</dt>
<dd>Longden J., Robin X., Engel M., <i>et al.</i>
<a href="https://doi.org/10.1016/j.celrep.2020.108657">Deep neural networks identify signaling mechanisms of ErbB-family drug resistance from a continuous cell morphology space</a>. <i>Cell Reports</i>, 2021;34(3):108657.</dd>
<dt>Bates & Eddelbuettel, 2013</dt>
<dd>Bates D, Eddelbuettel D. <a href="http://www.jstatsoft.org/v52/i05/">Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package</a>. <i>Journal of Statistical Software</i>, 2013;52(5):1–24.</dd>
<dt>Bengio <i>et al.</i>, 2007</dt>
<dd>Bengio Y, Lamblin P, Popovici D, Larochelle H. <a href="https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf">Greedy layer-wise training of deep networks</a>. <i>Advances in neural information processing systems</i>. 2007;19:153–60.</dd>
<dt>Hinton & Salakhutdinov, 2006</dt>
<dd>Hinton GE, Salakhutdinov RR. <a href="http://dx.doi.org/10.1126/science.1127647">Reducing the Dimensionality of Data with Neural Networks</a>. <i>Science</i>. 2006;313(5786):504–7.</dd>
</dl>
<h2>Downloads</h2>
<ol>
<li><a href="/files/blog/2022/06/11/MNIST_video.tar.gz">Code to generate the video</a></li>
<li><a href="https://github.com/xrobin/DeepLearning">DeepLearning package source code</a></li>
</ol>
pROC 1.18.0tag:xavier.robin.name,2021-09-06:/blog/2021/09/06/proc-1.18.02021-09-06T18:34:01+02:002021-09-06T18:34:01+02:00<p>pROC version 1.18.0 is now available on CRAN now. Only a few changes were implemented in this release:</p>
<ul>
<li>Add <abbr title="Confidence Interval">CI</abbr> of the estimate for <code>roc.test</code> (DeLong, paired only for now) (code contributed by <a href="https://wz-billings.rbind.io/">Zane Billings</a>) (<a href="https://github.com/xrobin/pROC/pull/95">issue #95</a>).</li>
<li>Fix documentation and alternative hypothesis for Venkatraman test (<a href="https://github.com/xrobin/pROC/issues/92">issue #92</a>).</li>
</ul>
<p>You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>pROC 1.17.0.1tag:xavier.robin.name,2021-01-13:/blog/2021/01/13/proc-1.17.0.12021-01-13T16:19:16+01:002021-01-13T16:19:16+01:00<p>pROC version 1.17.0.1 is available on CRAN now. Besides several bug fixes and small changes, it introduces more values in <code>input</code> of <code>coords</code>.</p>
<p>Here is an example:</p>
<pre>
library(pROC)
data(aSAH)
rocobj <- roc(aSAH$outcome, aSAH$s100b)
coords(rocobj, x = seq(0, 1, .1), input="recall", ret="precision")
# precision
# 1 NaN
# 2 1.0000000
# 3 1.0000000
# 4 0.8601399
# 5 0.6721311
# 6 0.6307692
# 7 0.6373057
# 8 0.4803347
# 9 0.4517906
# 10 0.3997833
# 11 0.3628319
</pre>
<h2>Getting the update</h2>
<p>The update his available on CRAN now. You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>
<p>Here is the full changelog:</p>
<p>1.17.0.1 (2020-01-07):</p>
<ul>
<li>Fix CRAN incoming checks as requested by CRAN.</li>
</ul>
<p>1.17.0 (2020-12-29)</p>
<ul>
<li>Accept more values in <code>input</code> of <code>coords</code> (<a href="https://github.com/xrobin/pROC/issues/67">issue #67</a>).</li>
<li>Accept <code>kappa</code> for the <code>power.roc.test</code> of two ROC curves (<a href="https://github.com/xrobin/pROC/issues/82">issue #82</a>).</li>
<li>The <code>input</code> argument to <code>coords</code> for <code>smooth.roc</code> curves no longer has a default.</li>
<li>The <code>x</code> argument to <code>coords</code> for <code>smooth.roc</code> can now be set to <code>all</code> (also the default).</li>
<li>Fix bootstrap <code>roc.test</code> and <code>cov</code> with <code>smooth.roc</code> curves.</li>
<li>The <code>ggroc</code> function can now plot <code>smooth.roc</code> curves (<a href="https://github.com/xrobin/pROC/issues/86">issue #86</a>).</li>
<li>Remove warnings with <code>warnPartialMatchDollar</code> option (<a href="https://github.com/xrobin/pROC/issues/87">issue #87</a>).</li>
<li>Make tests depending on vdiffr conditional (<a href="https://github.com/xrobin/pROC/issues/88">issue #88</a>).</li>
</ul>pROC 1.16.1tag:xavier.robin.name,2020-01-14:/blog/2020/01/14/proc-1.16.12020-01-14T08:52:57+01:002020-01-14T08:52:57+01:00<p>pROC version 1.16.1 is a minor release that disables a timing-dependent test based on the microbenchmark package that can sometimes cause random failures on CRAN. This version contains no user-visible changes. Users don't need to install this update.</p>
pROC 1.16.0tag:xavier.robin.name,2020-01-12:/blog/2020/01/12/proc-1.16.02020-01-12T21:46:00+01:002020-01-12T21:46:00+01:00<p>pROC version 1.16.0 is available on CRAN now. Besides several bug fixes, the main change is the switch of the default value of the <code>transpose</code> argument to the <code>coords</code> function from <code>TRUE</code> to <code>FALSE</code>. As announced earlier, <strong>this is a backward incompatible change that will break any script that did not previously set the <code>transpose</code> argument</strong> and for now comes with a warning to make debugging easier. Scripts that set transpose explicitly are not unaffected.</p>
<h2>New return values of <code>coords</code> and <code>ci.coords</code></h2>
<p>With <code>transpose = FALSE</code>, the <code>coords</code> returns a tidy <code>data.frame</code> suitable for use in pipelines:</p>
<pre>
data(aSAH)
rocobj <- roc(aSAH$outcome, aSAH$s100b)
coords(rocobj, c(0.05, 0.2, 0.5), transpose = FALSE)
# threshold specificity sensitivity
# 0.05 0.05 0.06944444 0.9756098
# 0.2 0.20 0.80555556 0.6341463
# 0.5 0.50 0.97222222 0.2926829
</pre>
<p>The function doesn't drop dimensions, so the result is always a <code>data.frame</code>, even if it has only one row and/or one column.</p>
<p>If speed is of utmost importance, you can get the results as a non-transposed matrix instead:
<pre>
coords(rocobj, c(0.05, 0.2, 0.5), transpose = FALSE, as.matrix = TRUE)
# threshold specificity sensitivity
# [1,] 0.05 0.06944444 0.9756098
# [2,] 0.20 0.80555556 0.6341463
# [3,] 0.50 0.97222222 0.2926829
</pre>
<p>In some scenarios this can be a tiny bit faster, and is used internally in <code>ci.coords</code>.</p>
<p>Type <code>help(coords_transpose)</code> for additional information.</p>
<h3><code>ci.coords</code></h3>
<p>The <code>ci.coords</code> function now returns a list-like object:</p>
<pre>
ciobj <- ci.coords(rocobj, c(0.05, 0.2, 0.5))
ciobj$accuracy
# 2.5% 50% 97.5%
# 1 0.3628319 0.3982301 0.4424779
# 2 0.6637168 0.7433628 0.8141593
# 3 0.6725664 0.7256637 0.7787611
</pre>
<p>The <code>print</code> function prints a table with all the results, however this table is generated on the fly and not available directly.</p>
<pre>ciobj
# 95% CI (2000 stratified bootstrap replicates):
# threshold sensitivity.low sensitivity.median sensitivity.high
# 0.05 0.05 0.9268 0.9756 1.0000
# 0.2 0.20 0.4878 0.6341 0.7805
# 0.5 0.50 0.1707 0.2927 0.4390
# specificity.low specificity.median specificity.high accuracy.low
# 0.05 0.01389 0.06944 0.1250 0.3628
# 0.2 0.70830 0.80560 0.8889 0.6637
# 0.5 0.93060 0.97220 1.0000 0.6726
# accuracy.median accuracy.high
# 0.05 0.3982 0.4425
# 0.2 0.7434 0.8142
# 0.5 0.7257 0.7788
</pre>
<p>The following code snippet can be used to obtain all the information calculated by the function:</p>
<pre>
for (ret in attr(ciobj, "ret")) {
print(ciobj[[ret]])
}
# 2.5% 50% 97.5%
# 1 0.9268293 0.9756098 1.0000000
# 2 0.4878049 0.6341463 0.7804878
# 3 0.1707317 0.2926829 0.4390244
# 2.5% 50% 97.5%
# 1 0.01388889 0.06944444 0.1250000
# 2 0.70833333 0.80555556 0.8888889
# 3 0.93055556 0.97222222 1.0000000
# 2.5% 50% 97.5%
# 1 0.3628319 0.3982301 0.4424779
# 2 0.6637168 0.7433628 0.8141593
# 3 0.6725664 0.7256637 0.7787611
</pre>
<h2>Getting the update</h2>
<p>The update his available on CRAN now. You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>
<p>Here is the full changelog:</p>
<ul>
<li>BACKWARD INCOMPATIBLE CHANGE: <code>transpose</code> argument to <code>coords</code> switched to <code>FALSE</code> by default (<a href="https://github.com/xrobin/pROC/issues/54">issue #54</a>).</li>
<li>BACKWARD INCOMPATIBLE CHANGE: <code>ci.coords</code> return value is now of list type and easier to use.</li>
<li>Fix one-sided DeLong test for curves with <code>direction=">"</code> (<a href="https://github.com/xrobin/pROC/issues/64">issue #64</a>).</li>
<li>Fix an error in <code>ci.coords</code> due to expected <code>NA</code> values in some coords (like "precision") (<a href="https://github.com/xrobin/pROC/issues/65">issue #65</a>).</li>
<li>Ordrered predictors are converted to numeric in a more robust way (<a href="https://github.com/xrobin/pROC/issues/63">issue #63</a>).</li>
<li>Cleaned up <code>power.roc.test</code> code (<a href="https://github.com/xrobin/pROC/issues/50">issue #50</a>).</li>
<li>Fix pairing with <code>roc.formula</code> and warn if <code>na.action</code> is not set to <code>"na.pass"</code> or <code>"na.fail"</code> (<a href="https://github.com/xrobin/pROC/issues/68">issue #68</a>).</li>
<li>Fix <code>ci.coords</code> not working with <code>smooth.roc</code> curves.</li>
</ul>pROC 1.15.3tag:xavier.robin.name,2019-07-22:/blog/2019/07/22/proc-1.15.32019-07-22T09:07:57+02:002019-07-22T09:07:57+02:00<p>A new version of pROC, 1.15.3, has been released and is now available on CRAN. It is a minor bugfix release. Versions 1.15.1 and 1.15.2 were rejected from CRAN.</p>
<p>Here is the full changelog:</p>
<ul>
<li>Fix <code>-Inf</code> threshold in coords for curves with <code>direction = ">"</code> (<a href="https://github.com/xrobin/pROC/issues/60">issue 60</a>).</li>
<li>Keep list order in <code>ggroc</code> (<a href="https://github.com/xrobin/pROC/issues/58">issue 58</a>).</li>
<li>Fix erroneous error in <code>ci.coords</code> with <code>ret="threshold"</code> (<a href="https://github.com/xrobin/pROC/issues/57">issue 57</a>).</li>
<li>Restore lazy loading of the data and fix an <code>R CMD check</code> warning "Variables with usage in documentation object 'aSAH' not in code".</li>
<li>Fix vdiffr unit tests with ggplot2 3.2.0 (<a href="https://github.com/xrobin/pROC/issues/53">issue 53</a>).</li>
</ul>
pROC 1.15.0tag:xavier.robin.name,2019-06-01:/blog/2019/06/01/proc-1.15.02019-06-01T09:33:08+02:002019-06-01T09:33:08+02:00<p>The latest version of pROC, 1.15.0 has just been released. It features significant speed improvements, many bug fixes, new methods for use in dplyr pipelines, increased verbosity, and prepares the way for some backwards-incompatible changes upcoming in pROC 1.16.0.</p>
<h2>Verbosity</h2>
<p>Since its initial release, pROC has been detecting the <code>level</code>s of the positive and negative classes (cases and controls), as well as the <code>direction</code> of the comparison, that is whether values are higher in case or in control observations. Until now it has been doing so silently, but this has lead to several issues and misunderstandings in the past. In particular, because of the detection of <code>direction</code>, ROC curves in pROC will nearly always have an AUC higher than 0.5, which can at times hide problems with certain classifiers, or cause bias in resampling operations such as bootstrapping or cross-validation.</p>
<p>In order to increase transparency, pROC 1.15.0 now prints a message on the command line when it auto-detects one of these two arguments.</p>
<pre>
> roc(aSAH$outcome, aSAH$ndka)
<span style="color: red">Setting levels: control = Good, case = Poor
Setting direction: controls < cases</span>
Call:
roc.default(response = aSAH$outcome, predictor = aSAH$ndka)
Data: aSAH$ndka in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
Area under the curve: 0.612
</pre>
<p>If you run pROC repeatedly in loops, you may want to turn off these diagnostic messsages. The recommended way is to explicitly specify them explicitly:</p>
<pre>
roc(aSAH$outcome, aSAH$ndka, levels = c("Good", "Poor"), direction = "<")
</pre>
<p>Alternatively you can pass <code>quiet = TRUE</code> to the ROC function to silenty ignore them.</p>
<pre>
roc(aSAH$outcome, aSAH$ndka, quiet = TRUE)
</pre>
<p>As mentioned earlier this last option should be avoided when you are resampling, such as in bootstrap or cross-validation, as this could silently hide some biases due to changing directions.</p>
<h2>Speed</h2>
<p>Several bottlenecks have been removed, yielding significant speedups in the <code>roc</code> function with <code>algorithm = 2</code> (see <a href="https://github.com/xrobin/pROC/issues/44">issue 44</a>), as well as in the <code>coords</code> function which is now vectorized much more efficiently (see <a href="https://github.com/xrobin/pROC/issues/52">issue 52</a>) and scales much better with the number of coordinates to calculate. With these improvements pROC is now as fast as other ROC R packages such as ROCR.</p>
<p>With Big Data becoming more and more prevalent, every speed up matters and making pROC faster has very high priority. If you think that a particular computation is abnormally slow, for instance with a particular combination of arguments, feel free to <a href="https://github.com/xrobin/pROC/issues/new?template=Bug_report.md">submit a bug report</a>.</p>
<p>As a consequence, <code>algorithm = 2</code> is now used by default for numeric predictors, and is automatically selected by the new <code>algorithm = 6</code> meta algorithm. <code>algorithm = 3</code> remains slightly faster with very low numbers of thresholds (below 50) and is still the default with ordered factor predictors.</p>
<h2>Pipelines</h2>
<p>The <code>roc</code> function can be used in pipelines, for instance with <a href="https://dplyr.tidyverse.org/">dplyr</a> or <a href="https://magrittr.tidyverse.org/">magrittr</a>. This is still a highly experimental feature and will change significantly in future versions (see <a href="https://github.com/xrobin/pROC/issues/54">issue 54</a> for instance). Here is an example of usage:</p>
<pre>
library(dplyr)
aSAH %>%
filter(gender == "Female") %>%
roc(outcome, s100b)
</pre>
<p>The <code>roc.data.frame</code> method supports both standard and non-standard evaluation (NSE), and the <code>roc_</code> function supports standard evaluation only. By default it returns the <code>roc</code> object, which can then be piped to the <code>coords</code> function to extract coordinates that can be used in further pipelines</p>
<pre>
aSAH %>%
filter(gender == "Female") %>%
roc(outcome, s100b) %>%
coords(transpose=FALSE) %>%
filter(sensitivity > 0.6,
specificity > 0.6)
</pre>
<p>More details and use cases are available in the <code>?roc</code> help page.</p>
<h2 id="tc">Transposing coordinates</h2>
<p>Since the initial release of pROC, the <code>coords</code> function has been returning a matrix with thresholds in columns, and the coordinate variables in rows.</p>
<pre>
data(aSAH)
rocobj <- roc(aSAH$outcome, aSAH$s100b)
coords(rocobj, c(0.05, 0.2, 0.5))
# 0.05 0.2 0.5
# threshold 0.05000000 0.2000000 0.5000000
# specificity 0.06944444 0.8055556 0.9722222
# sensitivity 0.97560976 0.6341463 0.2926829
</pre>
<p>This format doesn't conform to the grammar of the <a href="https://www.tidyverse.org">tidyverse</a>, outlined by Hadley Wickham in his <a href="http://dx.doi.org/10.18637/jss.v059.i10">Tidy Data</a> 2014 paper, which has become prevalent in modern R language. In addition, the dropping of dimensions by default makes it difficult to guess what type of data <code>coords</code> is going to return.</p>
<pre>
coords(rocobj, "best")
# threshold specificity sensitivity
# 0.2050000 0.8055556 0.6341463
# A numeric vector
</pre>
<p>Although it is possible to pass <code>drop = FALSE</code>, the fact that it is not the default makes the behaviour unintuitive. In an upcoming version of pROC, this will be changed and <code>coords</code> will return a <code>data.frame</code> with the thresholds in rows and measurement in colums by default.</p>
<h3>Changes in 1.15</h3>
<ul>
<li>Addition of the <code>transpose</code> argument.</li>
<li>Display a warning if <code>transpose</code> is missing. Pass <code>transpose</code> explicitly to silence the warning.</li>
<li>Deprecation of <code>as.list</code>.</li>
</ul>
<p>With <code>transpose = FALSE</code>, the output is a tidy <code>data.frame</code> suitable for use in pipelines:</p>
<pre>
coords(rocobj, c(0.05, 0.2, 0.5), transpose = FALSE)
# threshold specificity sensitivity
# 0.05 0.05 0.06944444 0.9756098
# 0.2 0.20 0.80555556 0.6341463
# 0.5 0.50 0.97222222 0.2926829
</pre>
<p>It is recommended that new developments set <code>transpose = FALSE</code> explicitly. Currently these changes are neutral to the API and do not affect functionality outside of a warning.</code>
<h3>Upcoming backwards incompatible changes in future version (1.16)</h3>
<p>The next version of pROC will change the default <code>transpose</code> to <code>FALSE</code>. <strong>This is a backward incompatible change that will break any script that did not previously set <code>transpose</code></strong> and will initially come with a warning to make debugging easier. Scripts that set <code>transpose</code> explicitly will be unaffected.</p>
</p>
<h3>Recommendations</h3>
If you are writing a script calling the <code>coords</code> function, set <code>transpose = FALSE</code> to silence the warning and make sure your script keeps running smoothly once the default <code>transpose</code> is changed to <code>FALSE</code>. It is also possible to set <code>transpose = TRUE</code> to keep the current behavior, however is likely to be deprecated in the long term, and ultimately dropped.</p>
<h2>New <code>coords</code> return values</h2>
The <code>coords</code> function can now return two new values, <code>"youden"</code> and <code>"closest.topleft"</code>. They can be returned regardless of whether <code>input = "best"</code> and of the value of the <code>best.method</code> argument, although they will not be re-calculated if possible. They follow the <code>best.weights</code> argument as expected. See <a href="https://github.com/xrobin/pROC/issues/48">issue 48</a> for more information.
<h2>Bug fixes</h2>
<p>Several small bugs have been fixed in this version of pROC. Most of them were identified thanks to an increased <a href="https://codecov.io/github/xrobin/pROC">unit test coverage</a>. 65% of the code is now unit tested, up from 46% a year ago. The main weak points remain the testing of all bootstrapping and resampling operations. If you notice any unexpected or wrong behavior in those, or in any other function, feel free to <a href="https://github.com/xrobin/pROC/issues/new?template=Bug_report.md">submit a bug report</a>.</p>
<h2>Getting the update</h2>
<p>The update his available on CRAN now. You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>
<p>Here is the full changelog:</p>
<ul>
<li><code>roc</code> now prints messages when autodetecting <code>levels</code> and <code>direction</code> by default. Turn off with <code>quiet = TRUE</code> or set these values explicitly.</li>
<li>Speedup with <code>algorithm = 2</code> (<a href="https://github.com/xrobin/pROC/issues/44">issue 44</a>) and in <code>coords</code> (<a href="https://github.com/xrobin/pROC/issues/52">issue 52</a>).</li>
<li>New <code>algorithm = 6</code> (used by default) uses <code>algorithm = 2</code> for numeric data, and <code>algorithm = 3</code> for ordered vectors.</li>
<li>New <code>roc.data.frame</code> method and <code>roc_</code> function for use in pipelines.</li>
<li><code>coords</code> can now returns <code>"youden"</code> and <code>"closest.topleft"</code> values (<a href="https://github.com/xrobin/pROC/issues/48">issue 48</a>).</li>
<li>New <code>transpose</code> argument for <code>coords</code>, <code>TRUE</code> by default (<a href="https://github.com/xrobin/pROC/issues/54">issue 54</a>).</li>
<li>Use text instead of Tcl/Tk progress bar by default (<a href="https://github.com/xrobin/pROC/issues/51">issue 51</a>).</li>
<li>Fix <code>method = "density"</code> smoothing when called directly from <code>roc</code> (<a href="https://github.com/xrobin/pROC/issues/49">issue 49</a>).</li>
<li>Renamed <code>roc</code> argument <code>n</code> to <code>smooth.n</code>.</li>
<li>Fixed 'are.paired' ignoring smoothing arguments of <code>roc2</code> with <code>return.paired.rocs</code>.</li>
<li>New <code>ret</code> option <code>"all"</code> in <code>coords</code> (<a href="https://github.com/xrobin/pROC/issues/47">issue 47</a>)</li>
<li><code>drop</code> in <code>coords</code> now drops the dimension of <code>ret</code> too (<a href="https://github.com/xrobin/pROC/issues/43">issue 43</a>)</li>
</ul>
pROC 1.14.0tag:xavier.robin.name,2019-03-13:/blog/2019/03/13/proc-1.14.02019-03-13T10:22:42+01:002019-03-13T10:22:42+01:00<p>pROC 1.14.0 was released with many bug fixes and some new features.</p>
<h2>Multiclass ROC</h2>
<p>The <code>multiclass.roc</code> function can now take a multivariate input with columns corresponding to scores of the different classes. The columns must be named with the corresponding class labels. Thanks Matthias Döring for the contribution.</p>
<p>Let's see how to use it in practice with the iris dataset. Let's first split the dataset into a training and test sets:</p>
<pre>
data(iris)
iris.sample <- sample(1:150)
iris.train <- iris[iris.sample[1:75],]
iris.test <- iris[iris.sample[76:150],]
</pre>
<p>We'll use the <code>nnet</code> package to generate some predictions. We use the <code>type="prob"</code> to the <code>predict</code> function to get class probabilities.</p>
<pre>library("nnet")
mn.net <- nnet::multinom(Species ~ ., iris.train)
iris.predictions <- predict(mn.net, newdata=iris.test, type="prob")
head(iris.predictions)
</pre>
<pre>
setosa versicolor virginica
63 2.877502e-21 1.000000e+00 6.647660e-19
134 1.726936e-27 9.999346e-01 6.543642e-05
150 1.074627e-28 7.914019e-03 9.920860e-01
120 6.687744e-34 9.986586e-01 1.341419e-03
6 1.000000e+00 1.845491e-24 6.590050e-72
129 4.094873e-45 1.779882e-15 1.000000e+00
</pre>
<p>Notice the column names, identical to the class labels. Now we can use the <code>multiclass.roc</code> function directly:</p>
<pre>multiclass.roc(iris.test$Species, iris.predictions)</pre>
<p>Many modelling functions have similar interfaces, where the output of <code>predict</code> can be changed with an extra argument. Check their documentation to find out how to get the required data.</p>
<h2>Multiple aesthetics for <code>ggroc</code></h2>
<p>It is now possible to pass several aesthetics to <code>ggroc</code>. So for instance you can map a curve to both <code>colour</code> and <code>linetype</code>:</p>
<pre>
roc.list <- roc(outcome ~ s100b + ndka + wfns, data = aSAH)
ggroc(roc.list, aes=c("linetype", "color"))
</pre>
<p class="imglegende center" style="max-width:768px"><img src="/files/blog/2019/03/12/ggroc_multiple_aes.png" alt="ROC curves mapped to several aesthetics"> <span>Mapping 3 ROC curves to 2 aesthetics with ggroc.</span></p>
<h2>Getting the update</h2>
<p>The update his available on CRAN now. You can update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>
<p>Here is the full changelog:</p>
<ul>
<li>The <code>multiclass.roc</code> function now accepts multivariate decision values (code contributed by Matthias Döring).</li>
<li><code>ggroc</code> supports multiple aesthetics.</li>
<li>Make <i>ggplot2</i> dependency optional.</li>
<li>Suggested packages can be installed interactively when required.</li>
<li>Passing both <code>cases</code> and <code>controls</code> or <code>response</code> and <code>predictor</code> arguments is now an error.</li>
<li>Many small bug fixes.</li>
</ul>pROC 1.13.0tag:xavier.robin.name,2018-09-24:/blog/2018/09/24/proc-1.13.02018-09-24T20:09:07+02:002018-09-24T20:10:44+02:00<p>pROC 1.13.0 was just released with bug fixes and a new feature.</p>
<h2>Infinite values in predictor</h2>
<p>Following the release of pROC 1.12, it quickly became clear with <a href="https://github.com/xrobin/pROC/issues/30">issue #30</a> that infinite values were handled differently by the different algorithms of pROC. The problem with these values is that they cannot be thresholded. An <code>Inf</code> will always be greater than any value. This means that in some cases, it may not be possible to reach 0 or 100% specificity or sensitivity. This also revealed that threshold-agnostic algorithms such as <code>algorithm="2"</code> or the DeLong theta calculations would happily reach 0 or 100% specificity or sensitivity in those case, although those values are unattainable.</p>
<p>Starting with 1.13.0, when pROC's <code>roc</code> function finds any infinite value in the <code>predictor</code> argument, or in <code>controls</code> or <code>cases</code>, it will return <code>NaN</code> (not a number).</p>
<h2>Numerical accuracy</h2>
<p>The handling of near ties close to + or - Infinity or 0 has been improved by calculating the threshold (which is the mean between two consecutive values) differently depending on the mean value itself. This allows preserving as much precision close to 0 without maxing out large absolute values.</p>
<h2>New argument for ggroc</h2>
<p><code>ggroc</code> can now take a new value for the <code>aes</code> argument, <code>aes="group"</code>. Consistent with ggplot2, it allows to curves with identical aesthetics to be split in different groups. This is especially useful for instance in facetted plots.
<pre>library(pROC)
data(aSAH)
roc.list <- roc(outcome ~ s100b + ndka + wfns, data = aSAH)
g.list <- ggroc(roc.list)
g.group <- ggroc(roc.list, aes="group")
g.group + facet_grid(.~name)
</pre>
<p class="imglegende center" style="max-width:672px"><img src="/files/blog/2018/09/24/ggroc_facet.png" alt="3 ROC curves in a facetted ggplot2 panel"> <span>Facetting of 3 ROC curves with ggroc.</span></p>
<h2>Getting the update</h2>
<p>The update has just been accepted on CRAN and should be online soon. Once it is out, update your installation by simply typing:</p>
<pre>install.packages("pROC")</pre>
<p>The full changelog is:</p>
<ul>
<li><code>roc</code> now returns <code>NaN</code> when predictor contains infinite values ( <a href="https://github.com/xrobin/pROC/issues/30">issue #30</a>).</li>
<li>Better handling of near-ties near +-Infinity and 0.</li>
<li><code>ggroc</code> supports <code>aes="group"</code> to allow curves with identical aesthetics.</li>
</ul>