# R²

A personal blog about my fumblings with statistics, finance and anything R

Built using jekyll and hyde

# A data.table Tutorial

This post is more for my learning but hopefully helps someone get acquainted with data.table() as well.

I’ll use the PimaIndiansDiabetes dataset from the mlbench package. By the way I really like this package. It has a lot of databases from the UC Irvine Machine Learning Database and is an excellent source of data for doing some data analysis and implementing some machine learning algorithms.

I’ll structure this tutorial using assignment style Questions and Answers.

### Basic structure

Here’s how I understand the basic setup for using an object of type data.table.

• Filter: Select specific Rows
• Select: Select specific columns
• Group By: Return the result of filtering and selecting grouped by some categorical variable

Let’s see some examples

### Filtering or Subsetting

One thing I always struggle with is subsetting data quickly and efficiently. data.table is awesome at that.

Q: How many people above the age of 50 had diabetes?

Q: How many people with a plasma glucose level range of [120 - 150] had diabetes?

### Setting a Key

data.table can utilize binary search to filter rows if the data.table object is sorted. This is done by setting a key.

Setting a key will also allow for the .() operator to be used.

### Selecting columns

We’ll now move on to the second argument in the data.table() framework.

Q: Whats the range of Plasma Glucose Concentration of people who had diabetees and were 21 years old?

This kind of subsetting and column selection can come in handy when trying to make some exploratory charts. Say we want to make a simple scatterplot to compare glucose vs pressure for people in the age bracket of 25 to 30.

Easy isn’t it?

Q: What’s the average value oftriceps skin fold thickness for people in the age bracket of [20, 30] who have diabetes?

### Grouping

Grouping the result of filtering rows and selecting columns by specific variables and being able to do computations on is very useful.

Q: What was the average body mass index of people in the age group of [21, 30], had diabetes grouped by the variable pregnant?

We can use this to create some charts as well.

For more details visit the data.table vignette.