A personal blog about my fumblings with statistics, finance and anything R


Built using jekyll and hyde

A data.table Tutorial

This post is more for my learning but hopefully helps someone get acquainted with data.table() as well.

I’ll use the PimaIndiansDiabetes dataset from the mlbench package. By the way I really like this package. It has a lot of databases from the UC Irvine Machine Learning Database and is an excellent source of data for doing some data analysis and implementing some machine learning algorithms.

I’ll structure this tutorial using assignment style Questions and Answers.

library(data.table)
library(mlbench)
library(ggplot2)
library(magrittr)

Basic structure

Here’s how I understand the basic setup for using an object of type data.table.

# mat is of type data.table
mat[Filter, Select, Group By]

Let’s see some examples

Creating a data table

data("PimaIndiansDiabetes")
ds <- data.table(PimaIndiansDiabetes)

# Since a data table supports all functions that be
# used on a dataframe we can do this
str(ds)
## Classes 'data.table' and 'data.frame':	768 obs. of  9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
## - attr(*, ".internal.selfref")=<externalptr>

Filtering or Subsetting

One thing I always struggle with is subsetting data quickly and efficiently. data.table is awesome at that.

Q: How many people above the age of 50 had diabetes?

subset <- ds[age > 50 & diabetes == "pos", ]
nrow(subset)
## [1] 38

Q: How many people with a plasma glucose level range of [120 - 150] had diabetes?

subset <- ds[glucose %in% 120:150 & diabetes == "pos"]
nrow(subset)
## [1] 93

Setting a Key

data.table can utilize binary search to filter rows if the data.table object is sorted. This is done by setting a key.

# Check for a key
haskey(ds)
## [1] FALSE
# Set a key
setkey(ds, diabetes)

# Note that after setting a key, the entire dataset is
# sorted using that key (in this case glucose)

head(ds)
##    pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1: 1 85 66 29 0 26.6 0.351 31 neg
## 2: 1 89 66 23 94 28.1 0.167 21 neg
## 3: 5 116 74 0 0 25.6 0.201 30 neg
## 4: 10 115 0 0 0 35.3 0.134 29 neg
## 5: 4 110 92 0 0 37.6 0.191 30 neg
## 6: 10 139 80 0 0 27.1 1.441 57 neg

Setting a key will also allow for the .() operator to be used.

Using the dot operator to subset

setkey(ds, age, diabetes)
subset <- ds[.(50, "pos"),]
head(subset)
##    pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1: 6 148 72 35 0 33.6 0.627 50 pos
## 2: 11 138 74 26 144 36.1 0.557 50 pos
## 3: 9 112 82 24 0 28.2 1.282 50 pos
## 4: 6 147 80 0 0 29.5 0.178 50 pos
## 5: 6 162 62 0 0 24.3 0.178 50 pos
nrow(subset)
## [1] 5

Selecting columns

We’ll now move on to the second argument in the data.table() framework.

Q: Whats the range of Plasma Glucose Concentration of people who had diabetees and were 21 years old?

# Note that we have set a key
key(ds)
## [1] "age"      "diabetes"
# We can use the key to answer this question
# Note the use of the pipe operator
ds[.(21, "pos"), glucose] %>% range # super clean and easy
## [1] 113 177

This kind of subsetting and column selection can come in handy when trying to make some exploratory charts. Say we want to make a simple scatterplot to compare glucose vs pressure for people in the age bracket of 25 to 30.

ds[.(25:30), .(glucose, pressure, diabetes)] %>% 
reshape2::melt() %>%
ggplot(aes(variable, value, fill = variable)) +
geom_boxplot() +
facet_grid(~ diabetes) +
ggtitle("Glucose vs Pressure")

center

Easy isn’t it?

Q: What’s the average value oftriceps skin fold thickness for people in the age bracket of [20, 30] who have diabetes?

# Note that this returns a locical vector of TRUE / FALSE values
# since we can only subset based on rows and not columns
ds[.(20:30, "pos"), triceps] %>% mean(na.rm = T)
## [1] 25.22222

Grouping

Grouping the result of filtering rows and selecting columns by specific variables and being able to do computations on is very useful.

Q: What was the average body mass index of people in the age group of [21, 30], had diabetes grouped by the variable pregnant?

# Note the use of the by argument
ds[.(21:30, "pos"), mean(mass), by = pregnant]
##     pregnant       V1
## 1: 0 40.60000
## 2: 1 41.29091
## 3: 2 34.99286
## 4: 4 34.92500
## 5: 3 32.10000
## 6: 8 28.40000
## 7: 5 39.43333
## 8: 9 29.00000
## 9: 6 31.50000
## 10: 7 26.50000
## 11: 10 0.00000

We can use this to create some charts as well.

ds$AgeInterval <- cut(ds$age, breaks = seq(21, 81, by = 10), include.lowest = T)

cols <- c("#00cc66", "#ff884d")

ds[,.(mean(glucose), mean(pressure)), by = .(AgeInterval, diabetes), nomatch = 0L] %>%
reshape2::melt() %>%
ggplot(aes(AgeInterval, value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ diabetes) +
scale_fill_manual(values = cols, labels = c("Glucose", "Pressure")) +
ggtitle("Avg. Glucose and Pressure grouped by age group")

center

For more details visit the data.table vignette.