Note: this post is a replication of our kernel published on Kaggle---Fully Fledged xkcd Theming (for RF, GBM, GLM & DL).

Although the focus in this kernel is rather on a topic not directly—at least in an obvious way—related to data science, I still think it might be of interest to some folks here.

A way to use the xkcd theme on kaggle is detailed. In particular, a trick to use any font not available on the Kaggle’s script environment is described.

The idea came first when I skimmed through the nice beluga’s kernel with a rather bold opening section titled First week progress where the xkcd style was shining at just the right intensity, without dazzling. (But the font was wrong!)

This Taxi competition is (i) on-going and (ii) is a kernel competition; therefore it might be the right place to present a tool to enhance kernels.

Besides, the application section presents a comparison of how well some popular algorithms perform at solving this problem (basically, the tools suite offered by the H2O framework for regression). Unlike our previous comparison of this sort—NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL)—where naive Bayes (NB) appeared a clear loser, here it remains unclear. GLM demonstrates lower performance RMSLE-wise than its counterparts; however, it is markedly quicker to train. (NB is absent here, because it is focused on classification problems—there are attempts described in the literature at trying to use NB for addressing regression problem, though…)

The xkcd Theme

How-To

A quick search revealed that there exists an xkcd package for R that does just that. When I tried, however, everything went seemingly smoothly until I realized that the xkcd font was missing from the Kaggle environment…

My rather short persistence in trying to install it using the extrafont package was being in vein when interrupted by the simpler idea of using the SVG format and the CSS @font-face capability in concert with the svglite package.

First, the font is converted to the WOFF format (we used Font Squirrel for that matter) and embedded as a Base64 encoded string.

<style type="text/css">
@font-face {
    font-family: 'xkcd';
    src: url(data:application/font-woff;charset=utf-8;base64, ...) format('woff');
    font-weight: normal;
    font-style: normal;
}
</style>

Second, Knitr is configured to generate images as SVG. Besides, we assume that the three packages svglite, xkcd and (optinaly) xkcdcolors are installed and loaded.

knitr::opts_chunk$set(
      dev = "svglite",
      fig.ext = ".svg"
)

Finally, the function svgstring (which takes the dimension of the image as parameters) is used to inline the SVG in the html document. That is to say, a chunk previously written as

```{r echo=TRUE, warning=FALSE, results='show', 
		message=FALSE, fig.width=10, fig.height=5}
ggplot()
```

becomes

```{r echo=TRUE, warning=FALSE, results='show', message=FALSE}
s <- svgstring(width = 10, height = 5)
ggplot()
invisible(dev.off())
htmltools::HTML(s())
```

When to Use

The xkcd style is not only amusing and denoting good taste, it also sets the tone to the argumentation being made (like a well-choosen emoticon can drastically influence the interpretation of a message).

I found this style quite well-suit for the kernel NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL), where a naive Bayes approach is contrasted with other popular methods, illustrating—without much room for ambiguity—the inferiority of the approach1. Such an establishment, as objective and rational as it can be, could be perceived as an abject smear campaign by the proponent of the method; and could easily hurt the touchiness of the Thomas Bayes’ disciples and/or the naive Bayes activists (which are, to my reckoning, two large groups—even with a substantial non-empty intersection between the two groups, the union is even larger—that is to say, wiser not to mess with them). Given this extra-sensitive context, and being not that temerarious, the reliance on the xkcd theme to mitigate the exposure to undesirable outcome appeared adequate.

To generalize, any tendentious claim that has the potential to (i) either bring the wrath of a large group or (ii) to eventually become embarrassing for the author2 had better to use such a theme. That way, should the all thing go south, the humoristic card can be played…

Application: Comparing Algorithms

To illustrate the first section, we compare how well some popular methods perform on the Taxi dataset (as we did in the NCI Thesaurus & Naive Bayes (vs RF, GBM, GLM & DL) kernel mentionned above). Furthermore, this reinforces the relevance of this kernel in here.

Feature Tinkering

This part is identical to the one of the same name in the previous kernel Autoencoder and Deep Features.

The longer the distance, the more time... Smart! 0 500 1000 1500 2000 0 5000 10000 15000 20000 distance as the crow flies trip duration

The predictors are: vendor_id, passenger_count, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, store_and_fwd_flag, distance_as_the_crow_flies, wday, hour (there are 10 features); we ignore the columns id, trip_duration, pickup_datetime (toIgnore).

The Algorithms

Here we briefly compare the performance—in terms of both score (RMSLE) and time required for training—of popular algorithms. This comparison has to be considered loosely, for we do not optimize the various parameters of each algorithm (e.g., by performing a grid search). Consequently, we contend that the yielded results here are an easy-to-achieve-minimum for each method.

Random Forest (RF)

rf <- h2o.randomForest(
  x = predictors, y = response, 
  training_frame = trainH2o, validation_frame = validH2o,
  ntrees = ifelse(kIsOnKaggle, 100, 500), max_depth = 10, seed = 1)

Gradient Boosting Machine (GBM)

gbm <- h2o.gbm(
  x = predictors, y = response, 
  training_frame = trainH2o, validation_frame = validH2o,
  ntrees = ifelse(kIsOnKaggle, 50, 500), max_depth = 5, seed = 1)

Generalized Linear Model (GLM)

glm <- h2o.glm(
  x = predictors, y = response, 
  training_frame = trainH2o, validation_frame = validH2o,
  family = "gaussian", seed = 1)

Deep Learning (DL)

dl <- h2o.deeplearning(
  x = predictors, y = response, 
  training_frame = trainH2o, validation_frame = validH2o,
  standardize = TRUE, hidden = c(10, 10), epochs = 70, 
  activation = "Rectifier", seed = 1)

Results

GBM DL RF GLM 0.0 0.2 0.4 0.6 Algorithm Score---the less, the better RMSLE Comparison GLM RF DL GBM The winners... 0.0 0.2 0.4 0.6 0 20 40 60 Training Time in Second Score---the less, the better Training Time vs. Score

License and Source Code

© 2017 Loic Merckel, Apache v2 licensed. The source code is available on both Kaggle and GitHub.

  1. Given the specific dataset and the quite shallow feature tinkering. 

  2. As a case in point, imagine a situation where beluga ends up lagging behind in the leaderboard…