NCI Thesaurus & Naive Bayes

Note: this post is a replication of our kernel published on Kaggle—NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL).

Following our previous kernel Preliminary Investigation: NCIt & Binary Encoding, we present another attempt at solving the Personalized Medicine: Redefining Cancer Treatment problem, but this time using naive Bayes.

A quick look at the literature revealed that naive Bayes had been the leading choice in solving various text classification problems for quite some time—see, e.g., the work of Cohen et al. (2005)¹ or Hotho et al. (2005)² and references therein. Although here we are not really dealing with a pure text classification problem, there are some obvious similarities; hence our interest in trying out a naive Bayes classifier…

Apparently, the main reason behind the popularity of the naive Bayes method is “because it is fast and easy to implement” to quote Rennie et al. (2003, abstract)³. A reason that lack compellingness when performance—in terms of predictions—is of critical importance. The authors open their introduction with the sobering statement (page 1):

Naive Bayes has been denigrated as “the punching bag of classifiers” (Lewis, 1998), and has earned the dubious distinction of placing last or near last in numerous head-to-head classification papers (Yang & Liu, 1999; Joachims, 1998; Zhang & Oles, 2001).

And a reason behind its reported poor performances seems to lie in the strong assumptions required by the method which are in most cases not satisfied (typically independence between features, plus some other assumptions about the distribution of those features). It is worth noting, though, that “naive Bayes assumptions can yield an optimal classifier in a variety of situations where the assumptions are widely violated” (Lewis, 1998, page 9)⁴.

Given that we do not implement any of the proposed remedies to mitigate the described common drawbacks of the naive Bayes method in text classification problems (Rennie et al., 2003)³, we do not expect winning outcome here…

The feature tinkering part is very similar to the one in our previous kernel. In particular, the NCI Thesaurus⁵—NCIt in short—is used to create a specialized dictionary of medical terms. This dictionary is used to limit the extraction of terms from the text data.

Features Tinkering

Processing Corpora

Two corpora will need to be processed in the same way (one to generate a specialized dictionary and the other to extract features from the clinical evidence provided in plain text form).

processCorpus <- function (corpus) {
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, stemDocument, language="english")
  corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
  corpus <- tm_map(corpus, removeWords, stopwords("english")) 
  corpus <- tm_map(corpus, function (x) {
    gsub("\\s*(?<!\\B|-)\\d+(?!\\B|-)\\s*", "", x, perl = TRUE)}) 
  corpus <- tm_map(corpus, stripWhitespace)
  return (corpus)
}

Get Dictionary from NCI Thesaurus (NCIt)

We get the NCI Thesaurus (the ReadMe file provides information regarding the format).

Bear in mind that the NCI Thesaurus’s license terms and conditions need to be accepted prior to using the file. (As stated in the FAQ: “basically under the license you can do anything you want with the NCI Thesaurus content except to change it while still calling it NCI Thesaurus.”)

getDictionary <- function (filePath) {
  if (is.null(filePath)) return(NULL)
  ret <- do.call(rbind, strsplit(readLines(filePath), "\t", fixed = TRUE))
  ret <- setNames(data.table(ret, stringsAsFactors = FALSE), 
                  c("code", "concept_name", "parents", 
                    "synonyms", "definition", "display_name",
                    "concept_status", "semantic_type"))
  
  corpus <- Corpus(VectorSource(gsub("\\|", " ", ret$synonyms)))
  corpus <- processCorpus (corpus)
  
  terms <- DocumentTermMatrix(
  corpus, 
  control = list(
    minWordLength = 3,
    weighting = function(x) weightTfIdf(x, normalize = FALSE)))
    
  return (as.vector(terms$dimnames$Terms))
}

filePath <- file.path(
    ".", 
    "Personalized Medicine-Redefining Cancer Treatment", 
    "data", "Thesaurus.txt", fsep = .Platform$file.sep)
tryCatch({
  stop("for test")
  if (is.null(filePath) || !file.exists(filePath)) {
    destfile <- file.path(".", "thesaurus.zip", fsep = .Platform$file.sep)
    download.file(
      url = "https://evs.nci.nih.gov/ftp1/NCI_Thesaurus/Thesaurus_17.06d.FLAT.zip",
      destfile = destfile)
    filePath <<- unzip(destfile, exdir = dirname("."))
    file.remove(destfile)
    rm(destfile)
  }
}, error = function(e) {
  print (e)
  filePath <<- NULL
})

medDico <- getDictionary(filePath)

Extract Features from Text Files

Note that we use the test set in the feature tinkering process. We must bear in mind that it is definitely not a good practice, for the test set becomes part of the training process (basically, the resulting model might perform well on this test set, but might miserably fail to generalize to a new unseen data set).

getFeaturesFromTxt <- function (data, dictionary, sparse = NULL) {
  corpus <- Corpus(VectorSource(data$txt))
  corpus <- processCorpus (corpus)
  if (!is.null(dictionary)) {
    terms <- DocumentTermMatrix(
      corpus,
      list(dictionary = dictionary, 
           weighting = function(x) weightTfIdf(x, normalize = TRUE)))
    if (!is.null(sparse)) {
      terms <- removeSparseTerms(terms, sparse)
    }
  } else {
    terms <- DocumentTermMatrix(
      corpus, 
      control = list(
        minWordLength = 3,
        weighting = function(x) weightTfIdf(x, normalize = FALSE)))
    if (!is.null(sparse)) {
      terms <- removeSparseTerms(terms, sparse)
    }
  }
  return (as.matrix(terms))
}

extractedFeatures <- getFeaturesFromTxt(
  data = data, 
  dictionary = medDico, 
  sparse = switch(ifelse(length(medDico) > 5000, 2, 1), 0.99, 0.95))

TF-IDF-based wordcloud (doc ID 152):

data <- cbind(data, extractedFeatures)
data[, txt := NULL]

Naive Bayes Classifier

In R, there are several implementations, for example in the packages e1071, KlaR, naivebayes or h2o. Here, we use the one in h2o.

The predictors are (truncated): Gene, Variation, onset, pci, percutan, stablhrs, stemi, symptom, fulldos, infarctionst, stabl, success, therapi, thrombolyt, infarctionunstableovhour, unstablhrs, mitral, repair, valv, pericardi, strip, evalu, postcardiac, transplant, noncardiovascular, …, bpr01blunt, disorient, methotrexateprednisonethioguaninevincristin, mtxpredtgvcr, qtcun, eq5d01mobl, mobil, eq5d0102, selfcar, eq5d0104, vas, hamd101, hamd1depress, r24, ansamycin, antibiot, antineoplast, hamd103, hamd108, ch50, plathct, a1aglp, npap, iohexol, rubriblast (there are 728 features); we ignore the columns ID, Class, txt (toIgnore).

Data Split

A validation set is extracted from the train set. We need a validation set to select the best models out of the grid search.

splits <- h2o.splitFrame(
  data = as.h2o(train), 
  ratios = c(0.8),
  seed = 1)

trainH2o <- splits[[1]]
validH2o <- splits[[2]]
testH2o <- as.h2o(test)

Grid Search

hyperParamsNaiveBayes = list( 
  laplace = c(0, 5),
  min_sdev = c(0.001, 0.004, 0.006),
  # eps_sdev = c(0, 0.002),
  min_prob = c(0.001, 0.006),
  # eps_prob = c(0, 0.002),
  compute_metrics = c(TRUE)) 

h2o.rm("grid_naivebayes")
gridAutoencoder <- h2o.grid(
  x = predictors,
  y = response,
  training_frame = trainH2o,
  validation_frame = validH2o,
  hyper_params = hyperParamsNaiveBayes,
  search_criteria = list(strategy = "Cartesian"),
  algorithm = "naivebayes",
  grid_id = "grid_naivebayes", 
  balance_classes = FALSE,
  seed = 1)

The following table summarizes the grid results (it is sorted decreasingly by ‘accuracy’):

	compute_metrics	laplace	min_prob	min_sdev	accuracy
1	true	5.0	0.001	0.006	0.4810126582278481
2	true	0.0	0.001	0.006	0.4810126582278481
3	true	5.0	0.006	0.006	0.4810126582278481
4	true	0.0	0.006	0.006	0.4810126582278481
5	true	5.0	0.001	0.004	0.4683544303797469
6	true	0.0	0.001	0.004	0.4683544303797469
7	true	5.0	0.006	0.004	0.4683544303797469
8	true	0.0	0.006	0.004	0.4683544303797469
9	true	0.0	0.001	0.001	0.45569620253164556
10	true	5.0	0.006	0.001	0.44303797468354433
11	true	5.0	0.001	0.001	0.44303797468354433
12	true	0.0	0.006	0.001	0.44303797468354433

Performance on the Validation Set

bestNbModel <- h2o.getModel(sortedGridNb@model_ids[[1]])
h2o.confusionMatrix(bestNbModel, valid = TRUE)

## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##         1 2 3  4 5 6  7 8 9  Error      Rate
## 1       6 1 0  2 0 0  3 0 0 0.5000 =  6 / 12
## 2       1 2 0  1 0 0  6 0 0 0.8000 =  8 / 10
## 3       0 1 0  0 0 0  1 0 0 1.0000 =   2 / 2
## 4       6 0 0  7 1 2  3 0 0 0.6316 = 12 / 19
## 5       1 1 0  1 3 0  0 0 0 0.5000 =   3 / 6
## 6       1 1 0  0 0 3  1 0 0 0.5000 =   3 / 6
## 7       2 1 1  2 0 0 17 0 0 0.2609 =  6 / 23
## 8       0 0 0  0 0 0  0 0 0        =   0 / 0
## 9       0 0 0  0 0 0  1 0 0 1.0000 =   1 / 1
## Totals 17 7 1 13 4 5 32 0 0 0.5190 = 41 / 79

predValid <- as.data.table(h2o.predict (bestNbModel, newdata = validH2o))
predValid[, predict := NULL]

# here we make sure that y_true has all the classes
# (as one could be missing from the smaller validation set)
y_true <- as.vector(validH2o[[response]])
y_true <- c(y_true, c(paste0(seq(1, 9))))
y_true <- model.matrix(~ 0 + ., data.frame(as.character(y_true)))
y_true <- y_true[-((nrow(y_true)-9+1):nrow(y_true)), ]

nbScore <- MultiLogLoss(y_pred = as.matrix(predValid), y_true = y_true)

## [1] 17.92519

Hum… Another example that won’t contradict the unprestigious title of “the punching bag of classifiers.” One could argue that the feature extraction from text is overly simplistic here; which certainly (partly) explain those poor results. On the other hand, our previous kernel describes a solution using XGBoost with almost the same features, and delivers a much better outcome…

Retrain the Best Model Using the Entire Train Set

Here we use k-folds cross validation to evaluate the final model (trained using the entire train set—i.e., rbind(trainH2o, validH2o)).

nfolds <- ifelse(kIsOnKaggle, 2, 8)

finalModel <- do.call(h2o.naiveBayes,
                      {
                        p <- bestNbModel@parameters
                        p$model_id = NULL         
                        p$training_frame = h2o.rbind(trainH2o, validH2o)      
                        p$validation_frame = NULL  
                        p$nfolds = nfolds               
                        p
                      })

finalModel@model$cross_validation_metrics_summary

## Cross-Validation Metrics Summary: 
##                               mean          sd cv_1_valid cv_2_valid
## accuracy                  0.377501 3.536949E-5 0.37755102 0.37745097
## err                       0.622499 3.536949E-5   0.622449   0.622549
## err_count                    124.5    1.767767      122.0      127.0
## logloss                  20.946302  0.25053015     20.592  21.300606
## max_per_class_error            1.0         0.0        1.0        1.0
## mean_per_class_accuracy 0.33982143 0.054649327  0.2625356 0.41710725
## mean_per_class_error    0.66017854 0.054649327 0.73746437  0.5828928
## mse                      0.6211964 9.262516E-4 0.61988646  0.6225063
## r2                       0.8857343  0.00105865 0.88723147  0.8842372
## rmse                    0.78815967  5.87604E-4  0.7873287  0.7889907

Test Set & Submission

predTest <- as.data.table(h2o.predict (finalModel, newdata = testH2o))
predTest[, predict := NULL]

We save the result out of the best model.

Note: To avoid timeout, the size of the test set has been significantly reduced; consequently, the saved result is incomplete. Run the script locally by setting the flag kIsOnKaggle to FALSE.

result <- setNames(
  predTest, 
  c(paste0("class", seq(1, 9))))

fwrite(
  data.table(
    ID = as.vector(testH2o[["ID"]]), 
    result), 
  file = file.path(".", paste0("result-naivebayes_", Sys.Date(), ".csv"), 
                   fsep = .Platform$file.sep), 
  row.names = FALSE,  quote = FALSE)

Comparing Against Alternative Methods

Here we briefly compare the performance—in terms of both score (multi-class logloss) and time required for training—of popular algorithms. This comparison has to be considered loosely, for we do not optimize the various parameters of each algorithm (e.g., by performing a grid search). Consequently, we contend that the yielded results here are an easy-to-achieve-minimum for each method. (Bear in mind, however, that we did a bit of optimization for the naive Bayes one.)

The Challengers

Random Forest (RF)

rf <- h2o.randomForest(
  x = predictors,
  y = response,
  training_frame = trainH2o,
  validation_frame = validH2o,
  nfolds = nfolds,
  ntrees = ifelse(kIsOnKaggle, 100, 500),
  max_depth = 10,
  seed = 1)

predValidRf <- as.data.table(h2o.predict (rf, newdata = validH2o))
predValidRf[, predict := NULL]

rfScore <- MultiLogLoss(y_pred = as.matrix(predValidRf), y_true = y_true)
rfRunTime <- rf@model$run_time

Gradient Boosting Machine (GBM)

gbm <- h2o.gbm(
  x = predictors,
  y = response,
  training_frame = trainH2o,
  validation_frame = validH2o,
  ntrees = ifelse(kIsOnKaggle, 50, 500),
  max_depth = 5,
  seed = 1)

predValidGbm <- as.data.table(h2o.predict (gbm, newdata = validH2o))
predValidGbm[, predict := NULL]

gbmScore <- MultiLogLoss(y_pred = as.matrix(predValidGbm), y_true = y_true)
gbmRunTime <- gbm@model$run_time

Generalized Linear Model (GLM)

glm <- h2o.glm(
  x = predictors,
  y = response,
  training_frame = trainH2o,
  validation_frame = validH2o,
  family = "multinomial", 
  seed = 1)

predValidGlm <- as.data.table(h2o.predict (glm, newdata = validH2o))
predValidGlm[, predict := NULL]

glmScore <- MultiLogLoss(y_pred = as.matrix(predValidGlm), y_true = y_true)
glmRunTime <- glm@model$run_time

Deep Learning (DL)

dl <- h2o.deeplearning(
  x = predictors,
  y = response,
  training_frame = trainH2o,
  validation_frame = validH2o,
  standardize = TRUE,
  hidden = c(200, 200),
  epochs = ifelse(kIsOnKaggle, 30, 800),
  activation = "Rectifier", 
  categorical_encoding = "Binary",
  seed = 1)

predValidDl <- as.data.table(h2o.predict (dl, newdata = validH2o))
predValidDl[, predict := NULL]

dlScore <- MultiLogLoss(y_pred = as.matrix(predValidDl), y_true = y_true)
dlRunTime <- dl@model$run_time

Verdict

Summary of the results

##  algo     score runtime
##    NB 17.925188   0.340
##    RF  1.294498   6.978
##   GBM  1.974667  24.586
##   GLM  2.067138   2.563
##    DL  4.328637  13.366

Let’s chart those results (inspired by the First week progress section of beluga’s kernel, the xkcd style is used).

<?xml version=’1.0’ encoding=’UTF-8’ ?>

Those results are not really in favor of naive Bayes… Naive Bayes was the quickest to train… Yet, other techniques would certainly remain more appealing in many applications… ■

License and Source Code

Cohen, Aaron M., and William R. Hersh. “A survey of current work in biomedical text mining.” Briefings in bioinformatics 6.1 (2005): 57-71. ↩
Hotho, Andreas, Andreas Nürnberger, and Gerhard Paaß. “A brief survey of text mining.” Ldv Forum. Vol. 20. No. 1. 2005. ↩
Rennie, Jason D., et al. “Tackling the poor assumptions of naive bayes text classifiers.” Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003. ↩ ↩²
Lewis, David D. “Naive (Bayes) at forty: The independence assumption in information retrieval.” European conference on machine learning. Springer, Berlin, Heidelberg, 1998. ↩
NCI stands for National Cancer Institute. ↩