pROC 1.15.0

The latest version of pROC, 1.15.0 has just been released. It features significant speed improvements, many bug fixes, new methods for use in dplyr pipelines, increased verbosity, and prepares the way for some backwards-incompatible changes upcoming in pROC 1.16.0.

Verbosity

Since its initial release, pROC has been detecting the levels of the positive and negative classes (cases and controls), as well as the direction of the comparison, that is whether values are higher in case or in control observations. Until now it has been doing so silently, but this has lead to several issues and misunderstandings in the past. In particular, because of the detection of direction, ROC curves in pROC will nearly always have an AUC higher than 0.5, which can at times hide problems with certain classifiers, or cause bias in resampling operations such as bootstrapping or cross-validation.

In order to increase transparency, pROC 1.15.0 now prints a message on the command line when it auto-detects one of these two arguments.

	> roc(aSAH$outcome, aSAH$ndka)
	Setting levels: control = Good, case = Poor
	Setting direction: controls < cases

	Call:
	roc.default(response = aSAH$outcome, predictor = aSAH$ndka)

	Data: aSAH$ndka in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
	Area under the curve: 0.612

If you run pROC repeatedly in loops, you may want to turn off these diagnostic messsages. The recommended way is to explicitly specify them explicitly:

	roc(aSAH$outcome, aSAH$ndka, levels = c("Good", "Poor"), direction = "<")

Alternatively you can pass quiet = TRUE to the ROC function to silenty ignore them.

	roc(aSAH$outcome, aSAH$ndka, quiet = TRUE)

As mentioned earlier this last option should be avoided when you are resampling, such as in bootstrap or cross-validation, as this could silently hide some biases due to changing directions.

Speed

Several bottlenecks have been removed, yielding significant speedups in the roc function with algorithm = 2 (see issue 44), as well as in the coords function which is now vectorized much more efficiently (see issue 52) and scales much better with the number of coordinates to calculate. With these improvements pROC is now as fast as other ROC R packages such as ROCR.

With Big Data becoming more and more prevalent, every speed up matters and making pROC faster has very high priority. If you think that a particular computation is abnormally slow, for instance with a particular combination of arguments, feel free to submit a bug report.

As a consequence, algorithm = 2 is now used by default for numeric predictors, and is automatically selected by the new algorithm = 6 meta algorithm. algorithm = 3 remains slightly faster with very low numbers of thresholds (below 50) and is still the default with ordered factor predictors.

Pipelines

The roc function can be used in pipelines, for instance with dplyr or magrittr. This is still a highly experimental feature and will change significantly in future versions (see issue 54 for instance). Here is an example of usage:

library(dplyr)
aSAH %>% 
    filter(gender == "Female") %>% 
    roc(outcome, s100b)

The roc.data.frame method supports both standard and non-standard evaluation (NSE), and the roc_ function supports standard evaluation only. By default it returns the roc object, which can then be piped to the coords function to extract coordinates that can be used in further pipelines

aSAH %>%
    filter(gender == "Female") %>%
    roc(outcome, s100b) %>%
	coords(transpose=FALSE) %>%
    filter(sensitivity > 0.6,
           specificity > 0.6)

More details and use cases are available in the ?roc help page.

Transposing coordinates

Since the initial release of pROC, the coords function has been returning a matrix with thresholds in columns, and the coordinate variables in rows.

data(aSAH)
rocobj <- roc(aSAH$outcome, aSAH$s100b)
coords(rocobj, c(0.05, 0.2, 0.5))
#                   0.05       0.2       0.5
# threshold   0.05000000 0.2000000 0.5000000
# specificity 0.06944444 0.8055556 0.9722222
# sensitivity 0.97560976 0.6341463 0.2926829

This format doesn't conform to the grammar of the tidyverse, outlined by Hadley Wickham in his Tidy Data 2014 paper, which has become prevalent in modern R language. In addition, the dropping of dimensions by default makes it difficult to guess what type of data coords is going to return.

	coords(rocobj, "best")
	#   threshold specificity sensitivity 
	#   0.2050000   0.8055556   0.6341463 
	# A numeric vector

Although it is possible to pass drop = FALSE, the fact that it is not the default makes the behaviour unintuitive. In an upcoming version of pROC, this will be changed and coords will return a data.frame with the thresholds in rows and measurement in colums by default.

Changes in 1.15

With transpose = FALSE, the output is a tidy data.frame suitable for use in pipelines:

 coords(rocobj, c(0.05, 0.2, 0.5), transpose = FALSE)
#      threshold specificity sensitivity
# 0.05      0.05  0.06944444   0.9756098
# 0.2       0.20  0.80555556   0.6341463
# 0.5       0.50  0.97222222   0.2926829

It is recommended that new developments set transpose = FALSE explicitly. Currently these changes are neutral to the API and do not affect functionality outside of a warning.

Upcoming backwards incompatible changes in future version (1.16)

The next version of pROC will change the default transpose to FALSE. This is a backward incompatible change that will break any script that did not previously set transpose and will initially come with a warning to make debugging easier. Scripts that set transpose explicitly will be unaffected.

Recommendations

If you are writing a script calling the coords function, set transpose = FALSE to silence the warning and make sure your script keeps running smoothly once the default transpose is changed to FALSE. It is also possible to set transpose = TRUE to keep the current behavior, however is likely to be deprecated in the long term, and ultimately dropped.

New coords return values

The coords function can now return two new values, "youden" and "closest.topleft". They can be returned regardless of whether input = "best" and of the value of the best.method argument, although they will not be re-calculated if possible. They follow the best.weights argument as expected. See issue 48 for more information.

Bug fixes

Several small bugs have been fixed in this version of pROC. Most of them were identified thanks to an increased unit test coverage. 65% of the code is now unit tested, up from 46% a year ago. The main weak points remain the testing of all bootstrapping and resampling operations. If you notice any unexpected or wrong behavior in those, or in any other function, feel free to submit a bug report.

Getting the update

The update his available on CRAN now. You can update your installation by simply typing:

install.packages("pROC")

Here is the full changelog:

Xavier Robin
Publié le samedi 1 juin 2019 à 09:33 CEST
Lien permanent : /blog/2019/06/01/proc-1.15.0
Tags : pROC
Commentaires : 0

Commentaires

Aucun commentaire

Nouveau commentaire

* L'astérisque dénote un champ obligatoire.

En soumettant votre message, vous acceptez qu' il soit publié sous licence CC BY-SA 3.0.

Quelques balises HTML sont autorisées : a[href, hreflang, title], br, em, i, strong, b, tt, samp, kbd, var, abbr[title], acronym[title], code, q[cite], sub, sup.

Switch to English

Chercher

Tags

Bruit de fond Hobbys Humour Informatique Internet Livres Logiciels Moi Mon site web Mozilla Photo Politique Programmation Scolaire Ubuntu pROC

Billets récents

Calendrier

lun.mar.mer.jeu.ven.sam.dim.
12
3456789
10111213141516
17181920212223
24252627282930

Syndication

Recommender