When I had some free time, I went straight to the UC Irvine Machine Learning Lab website to see if I could find some relatively simple, real world data to try the technique on. I found a data file on cardiac arrhythmia which seemed about right for a test: not too big, a fair amount of missing data, mostly numeric, and possibly low rank. I know nothing about electro-cardiogram procedures, but thanks to Wikipedia, I discovered that they involved recording data through twelve leads. The data file itself has 279 attributes recorded for 452 individuals. In principle, all these 279 attributes should be created by some mixture of the twelve variables of the different leads, so this data should definitely be a low rank matrix, and possible to reconstruct using one method or another.
First I needed to do some data munging on the file. Beginning with code I found at:
http://www.togaware.com/datamining/survivor/Cardiac_Arrhythmia.html
I wrote some code in R to label all the variables, remove non-numeric columns, split the sample into training and test sets, and save the results into Rdata files.
Setup.R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cardiac <- read.csv("arrhythmia.data", header=F, na.strings="?") | |
colnames(cardiac)[1:280] <- c("Age","Gender_Nom","Height","Weight","QRS_Dur", | |
"P-R_Int","Q-T_Int","T_Int","P_Int","QRS","T","P","QRST","J","Heart_Rate", | |
"Q_Wave","R_Wave","S_Wave","R_Prime","S_Prime","Int_Def","Rag_R_Nom", | |
"Diph_R_Nom","Rag_P_Nom","Diph_P_Nom","Rag_T_Nom","Diph_T_Nom", | |
"DII00", "DII01","DII02", "DII03", "DII04","DII05","DII06","DII07","DII08","DII09","DII10","DII11", | |
"DIII00","DIII01","DIII02", "DIII03", "DIII04","DIII05","DIII06","DIII07","DIII08","DIII09","DIII10","DIII11", | |
"AVR00","AVR01","AVR02","AVR03","AVR04","AVR05","AVR06","AVR07","AVR08","AVR09","AVR10","AVR11", | |
"AVL00","AVRL1","AVL02","AVL03","AVL04","AVL05","AVL06","AVL07","AVL08","AVL09","AVL10","AVL11", | |
"AVF00","AVF01","AVF02","AVF03","AVF04","AVF05","AVF06","AVF07","AVF08","AVF09","AVF10","AVF11", | |
"V100","V101","V102","V103","V104","V105","V106","V107","V108","V109","V110","V111", | |
"V200","V201","V202","V203","V204","V205","V206","V207","V208","V209","V210","V211", | |
"V300","V301","V302","V303","V304","V305","V306","V307","V308","V309","V310","V311", | |
"V400","V401","V402","V403","V404","V405","V406","V407","V408","V409","V410","V411", | |
"V500","V501","V502","V503","V504","V505","V506","V507","V508","V509","V510","V511", | |
"V600","V601","V602","V603","V604","V605","V606","V607","V608","V609","V610","V611", | |
"JJ_Wave","Q_Wave","R_Wave","S_Wave","R_Prime_Wave","S_Prime_Wave","P_Wave","T_Wave", | |
"QRSA","QRSTA", | |
"DII170","DII171","DII172","DII173","DII174","DII175","DII176","DII177","DII178","DII179", | |
"DIII180","DIII181","DIII182","DIII183","DIII184","DIII185","DIII186","DIII187","DIII188","DIII189", | |
"AVR190","AVR191","AVR192","AVR193","AVR194","AVR195","AVR196","AVR197","AVR198","AVR199", | |
"AVL200","AVL201","AVL202","AVL203","AVL204","AVL205","AVL206","AVL207","AVL208","AVL209", | |
"AVF210","AVF211","AVF212","AVF213","AVF214","AVF215","AVF216","AVF217","AVF218","AVF219", | |
"V1220","V1221","V1222","V1223","V1224","V1225","V1226","V1227","V1228","V1229", | |
"V2230","V2231","V2232","V2233","V2234","V2235","V2236","V2237","V2238","V2239", | |
"V3240","V3241","V3242","V3243","V3244","V3245","V3246","V3247","V3248","V3249", | |
"V4250","V4251","V4252","V4253","V4254","V4255","V4256","V4257","V4258","V4259", | |
"V5260","V5261","V5262","V5263","V5264","V5265","V5266","V5267","V5268","V5269", | |
"V6270","V6271","V6272","V6273","V6274","V6275","V6276","V6277","V6278","V6279", | |
"Class_Nom" | |
) | |
cardiac_num <- cardiac[,grep("*Nom", colnames(cardiac), invert=T)] | |
# Some columns have max and min of zero, so are probably useless. | |
# It is possible that some groups of columns are actually time plots of | |
# data that pass through zero. Need to plot. | |
#cardiac_num <- cardiac_num[, apply(cardiac_num, 2, function(x) ! all(x == 0))] | |
# Clean out NA's from numeric data | |
# This reduces from 452 to 68 cases, so must find a way to handle missing | |
# data. | |
#cardiac_num <- na.omit(cardiac_num) | |
# We need to separate training and test sets. Would like the same sample | |
# so set a seed. | |
set.seed(1234) | |
index <- 1:nrow(cardiac) | |
trainindex <- sample(index, trunc(2 * length(index)/3)) | |
cardiac_train <- cardiac[trainindex, ] | |
cardiac_test <- cardiac[-trainindex, ] | |
# Save unmodified, train, and test sets | |
write.table(cardiac, "cardiac.csv", sep=",", row.names=F) | |
save(cardiac, file="cardiac.RData", compress=TRUE) | |
write.table(cardiac_train, "cardiac_train.csv", sep=",", row.names=F) | |
write.table(cardiac_test, "cardiac_test.csv", sep=",", row.names=F) | |
# Separate train and test numeric data | |
cardiac_num_train <- cardiac_num[trainindex, ] | |
cardiac_num_test <- cardiac_num[-trainindex, ] | |
# Next we must construct a matrix of 0's and 1's with 1's at points where | |
# we have data, and 0's where we have NA's | |
# For the method in Coursera Machine Learning, we will need both | |
# train and test sets to have a matrix like this. | |
r <- rep(1, times=(nrow(cardiac_num_train)*ncol(cardiac_num_train))) | |
Rtrain <- matrix(r, nrow=nrow(cardiac_num_train), ncol=ncol(cardiac_num_train)) | |
r <- rep(1, times=(nrow(cardiac_num_test)*ncol(cardiac_num_test))) | |
Rtest <- matrix(r, nrow=nrow(cardiac_num_test), ncol=ncol(cardiac_num_test)) | |
# Replace 1's with 0's where there are NA's in cardiac data | |
Rtrain[is.na(cardiac_num_train)] <- 0 | |
Rtest[is.na(cardiac_num_test)] <- 0 | |
# Replace NA's with 0 for numeric data | |
cardiac_num_train[is.na(cardiac_num_train)] <- 0 | |
# For the method in Coursera Machine Learning, we will need both | |
# train and test sets to have a matrix like this. | |
r <- rep(1, times=(nrow(cardiac_num_train)*ncol(cardiac_num_train))) | |
Rtrain <- matrix(r, nrow=nrow(cardiac_num_train), ncol=ncol(cardiac_num_train)) | |
r <- rep(1, times=(nrow(cardiac_num_test)*ncol(cardiac_num_test))) | |
Rtest <- matrix(r, nrow=nrow(cardiac_num_test), ncol=ncol(cardiac_num_test)) | |
# Replace 1's with 0's where there are NA's in cardiac data | |
Rtrain[is.na(cardiac_num_train)] <- 0 | |
Rtest[is.na(cardiac_num_test)] <- 0 | |
# Replace NA's with 0 for numeric data | |
cardiac_num_train[is.na(cardiac_num_train)] <- 0 | |
cardiac_num_test[is.na(cardiac_num_test)] <- 0 | |
# Write numeric tables | |
write.table(cardiac_num_train, "cardiac_num_train.csv", sep=",", row.names=F) | |
write.table(cardiac_num_test, "cardiac_num_test.csv", sep=",", row.names=F) | |
# Save work in R format | |
save(cardiac_num_train, cardiac_num_test, Rtrain, Rtest, file="cardiacSetup.rda") | |
# Clear test data from environment | |
rm(cardiac, cardiac_num, cardiac_test, cardiac_num_test, trainindex, index,r) |
Eyeballing things, it seems as though 10 to 20 features would be a reasonable reduced basis, at least from the complete data examples. This is good, since it does not obviously conflict with the number of leads used in the apparatus, which is the reason to settle on this number of parameters. It seems that we should be able to apply the method of minimizing parameters to this data to recover the NA values.
So on to the code which should let us try this in R. First, a file of which specifies the cost and gradient functions for optim(), then a script to run them and see whether this approach recovers missing values which we have zeroed out ourselves for a test.
CollaborativeFiltering.R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Cost and gradient functions for optim(). These are a port of Coursera | |
# Machine Learning course topic Collaborative Filtering from MATLAB | |
# to R. | |
# Cost Function | |
coFilterCost <- function (params, npatients, nleads, nreadings,Y, R, lambda) | |
{ | |
# params: X and Theta flattened to vector | |
# npatients, nleads, nreadings: number of patients, leads, and numeric | |
# fields in data | |
# The values in X and Theta are what we are optimizing, so they need to be | |
# flattened into a vector of parameters for optim() | |
# X: estimated lead contributions to each recorded reading | |
# Numeric data fields X 12 leads + Lead[0] = 13 | |
# Theta: estimated Cardiograph lead readings | |
# Number of Patients X 12 leads + Lead[0] = 13 | |
# Y: actual patient data | |
# Number Readings X Patients | |
# R: matrix with 1's in positions of Y with patient data and 0's elsewhere. | |
# Numeric Fields X Patients | |
# lambda: regularization parameter. | |
# Add l[0] term for y-intercept | |
nleads <- nleads + 1 | |
sx <- nreadings * nleads | |
st <- npatients * nleads | |
X <- matrix(params[1:sx], nreadings, nleads) | |
Theta <- matrix(params[(sx + 1):(sx + st)], npatients, nleads) | |
# Must transpose either R and Y or X %*% t(Theta) | |
S <- R * (X %*% t(Theta) - Y) | |
J <- sum(S^2)/2.0 # This is mean squared error | |
# This prevents params growing | |
J <- J + (lambda/2.0)*(sum(Theta^2) + sum(X^2)) | |
return(J) | |
} | |
coFilterGrad <- function(params, npatients, nleads, nreadings, Y, R, lambda) { | |
# Add l[0] term for y-intercept | |
nleads <- nleads + 1 | |
sx <- nreadings * nleads | |
st <- npatients * nleads | |
X <- matrix(params[1:sx], nreadings, nleads) | |
Theta <- matrix(params[(sx + 1):(sx + st)], npatients, nleads) | |
Theta_grad <- matrix(0, npatients, nleads) | |
X_grad <- matrix(0, nreadings, nleads) | |
S <- R * (X %*% t(Theta) - Y) | |
X_grad <- S %*% Theta | |
Theta_grad <- t(S) %*% X | |
X_grad <- X_grad + (lambda * X) | |
Theta_grad <- Theta_grad + (lambda * Theta) | |
# Flatten everything to a vector as in octave. May be wrong idea. | |
grad <- c(as.vector(X_grad), as.vector(Theta)) | |
return(grad) | |
} | |
normalizeY <- function(Y,R) { | |
# Averaging a quantity changing through time is doubtful. Some readings | |
# might possibly be voltages at intervals or something of the sort. | |
# Dividing by max(row value) seemed to cause optim() to zero all values in | |
# the Theta matrix. This might be due to linear dependence, since it is | |
# multiplying all values by a constant, but I am not sure. Subtracting | |
# the average also causes optim() to zero out the Theta matrix as returned | |
# in $par, but I should normalize the Y matrix somehow, not quite sure what | |
# will work. | |
rc <- dim(Y) | |
Ynorm <- matrix(0, rc[1], rc[2]) | |
nrm <- vector(length=rc[1]) | |
# Need average of each feature. Y is nreadings X npatients | |
# so need average across rows | |
for( i in 1:rc[1]){ | |
idx <- which(R[i,] == 1) | |
if(length(idx) == 0) { | |
nrm[i] <- 0 | |
} | |
else { | |
nrm[i] <- mean(Y[i,idx]) | |
} | |
Ynorm[i, idx] <- Y[i, idx] - nrm[i] | |
} | |
yl <- list(Ynorm, nrm) | |
return(yl) | |
} | |
restoreY <- function(Ynorm, nrm) { | |
rc <- dim(Ynorm) | |
Y <- matrix(0, rc[1], rc[2]) | |
for(i in 1:rc[1]) { | |
Y[i,] <- Ynorm[i,] + nrm[i] | |
} | |
return(Y) | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Try setting up a smaller test set | |
# Variables to set how large a square of data to test | |
npatients <- 4 | |
nreadings <- 8 | |
# We are still not converging in 500 trials. Perhaps we have too many unknowns. | |
#nleads <- 10 | |
nleads <- 12 | |
load("cardiacSetup.rda") | |
# Chop a human friendly size chunk out of the data | |
Y <- as.matrix(cardiac_num_train[1:npatients, 1:nreadings]) | |
#R <- Rtrain[1:npatients, 1:nreadings] | |
# Machine learning course seemed to be setting things up like: | |
# data X source of data, ie, movie ratings X users, for Y. Thought this might | |
# be the reason that optim() was zeroing parameters in the second matrix Theta, but probably it is | |
# zeroing linearly dependent columns, maybe. | |
Y <- t(Y) | |
# Zero out same data each time | |
set.seed(4321) | |
# May be better way to do this, more like MATLAB | |
#y <- as.vector(Ynorm) | |
y <- as.vector(Y) | |
y[runif(length(y)) > 0.5] <- 0 | |
Yn <- matrix(y, nreadings, npatients) | |
R <- matrix(0, nreadings, npatients) | |
R[Yn != 0] <- 1 | |
yl <- normalizeY(Yn, R) | |
# X and Theta are actually initialized with random data. | |
# Each reading recorded was created by some mixture of the various leads | |
X <- matrix(rnorm(nreadings*nleads), nrow=nreadings, ncol=nleads, byrow=T) | |
# Each patient had a particular set of lead readings | |
Theta <- matrix(rnorm(npatients*nleads), nrow=npatients, ncol=nleads, byrow=T) | |
# Must add y intercept parameter to X and Theta | |
X <- cbind(1, X) | |
Theta <- cbind(1, Theta) | |
source("CollaborativeFiltering.R") | |
params <- c(as.vector(X), as.vector(Theta)) | |
# This seems to run, but it gives me a 1 on convergence rather than a 0. | |
#p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Y, R=R, lambda=0) | |
# Using method="CG" | |
#p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="CG", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Y, R=R, lambda=0) | |
# Using method="BFGS". Seems to converge more reliably. | |
# This is normalized by subtracting the row mean from each feature. This causes optim() to zero out all values in the Theta | |
# matrix as returned in pn$par so nothing further can be done. | |
pn <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="BFGS", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=yl[[1]], R=R, lambda=0) | |
p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="BFGS", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Yn, R=R, lambda=0) | |
# convergence 1 means that iteration limit has been reached without | |
# convergence. | |
sx <- nreadings * (nleads+1) | |
st <- npatients * (nleads+1) | |
Xp <- matrix(p$par[1:sx], nreadings, (nleads+1)) | |
Thetap <- matrix(p$par[(sx + 1):(sx + st)], npatients, (nleads+1)) | |
Yp <- Xp %*% t(Thetap) | |
cat("Convergence:\n") | |
cat(p$convergence) | |
cat("\n") | |
cat("MSE:\n") | |
cat(sum((Yp - Y)^2)/(nreadings*nleads)) | |
cat("\n") |
Regrettably, none of this worked quite as I had thought. First, I had a bit of a struggle with optim() just understanding how one should use it with two entire matrices as a parameter. I decided that I should just flatten everything to a vector, just as I had in octave. Then, much more confusing, I found that optim() was zeroing out the entire second matrix. Since the process requires that one multiply matrix X * Theta' to regenerate the data matrix, this would not work at all. I found that when I fed the data in without any sort of normalizing, (which was required at least by the course lectures), I got back two matrices, but when I touch the data in any way, (either by dividing by the max of each feature or subtracting the average), the second matrix came back filled with zeros. This was quite useless for a method that required multiplying one matrix by another.
Since I wasn't having success with R, I decided to return to the original octave code, and see whether the method worked on this data. Unfortunately, it did not work there any better. The code ran and produced a matrix of the proper dimensions of cardiac readings by patients, but the numbers were not even the same order of magnitude at many points in the regenerated matrix.
Some things which occur to me:
The movie score matrix was very simple. It was composed of integers between 1 and 5 with many missing data points. Next, the problem that was being solved wasn't actually restoring data, it was yet another classification problem. There are two classes which a movie can belong to in this problem, these being interesting, and uninteresting. We don't need to accurately construct an unknown score, we only need to sort this particular movie into one class or the other. I hadn't been aware that I was actually solving a classification problem while I was working through the code for class. In fact, classification problems seem to be the only sort of problems Machine Learning concerns itself with at least by the course material for this class. This again is something that I had not noticed until I started working on this data set. Possibly this method could be improved by adding another variable like the lambda to push down on the mean squared error section of the code, but it would probably be best to try another approach. I read of a method that involved reconstructing a matrix by minimizing the sum of the singular values returned by svd(). This seems more likely. I might try this, as well as trying a neural net and K-means on this data set if I have some more free time.