MathJax

MathJax

Tuesday, June 24, 2014

Trying out some techniques from Coursera Machine Learning Class

I finished the Machine Learning course from Andrew Ng on Coursera recently. The last topic he covered in the course was on recommender systems. There was a technique he described that involved multiplying a matrix of user preferences by a matrix of movie scores to fill in data on movies which had not been rated, and to allow the system to make recommendations for movies the users had not seen.  It immediately occured to me that this should be a general method one could use to fill in missing data in many sorts of situations.  I decided that I would need to investigate the technique further after the class ended.  I would try to find some relatively simple data file and see how one would go about applying the technique to it.  During a lecture, professor Andrew Ng mentioned that the technique he was describing was known as Low Rank Matrix Factorization.  I did a bit of web searching and found more information under "Low Rank Matrix Reconstruction."

When I had some free time, I went straight to the UC Irvine Machine Learning Lab website to see if I could find some relatively simple, real world data to try the technique on.  I found a data file on cardiac arrhythmia which seemed about right for a test: not too big, a fair amount of missing data, mostly numeric, and possibly low rank.  I know nothing about electro-cardiogram procedures, but thanks to Wikipedia, I discovered that they involved recording data through twelve leads.  The data file itself has 279 attributes recorded for 452 individuals.  In principle, all these 279 attributes should be created by some mixture of the twelve variables of the different leads, so this data should definitely be a low rank matrix, and possible to reconstruct using one method or another.

First I needed to do some data munging on the file. Beginning with code I found at:

http://www.togaware.com/datamining/survivor/Cardiac_Arrhythmia.html


I wrote some code in R to label all the variables, remove non-numeric columns, split the sample into training and test sets, and save the results into Rdata files.

Setup.R

cardiac <- read.csv("arrhythmia.data", header=F, na.strings="?")
colnames(cardiac)[1:280] <- c("Age","Gender_Nom","Height","Weight","QRS_Dur",
"P-R_Int","Q-T_Int","T_Int","P_Int","QRS","T","P","QRST","J","Heart_Rate",
"Q_Wave","R_Wave","S_Wave","R_Prime","S_Prime","Int_Def","Rag_R_Nom",
"Diph_R_Nom","Rag_P_Nom","Diph_P_Nom","Rag_T_Nom","Diph_T_Nom",
"DII00", "DII01","DII02", "DII03", "DII04","DII05","DII06","DII07","DII08","DII09","DII10","DII11",
"DIII00","DIII01","DIII02", "DIII03", "DIII04","DIII05","DIII06","DIII07","DIII08","DIII09","DIII10","DIII11",
"AVR00","AVR01","AVR02","AVR03","AVR04","AVR05","AVR06","AVR07","AVR08","AVR09","AVR10","AVR11",
"AVL00","AVRL1","AVL02","AVL03","AVL04","AVL05","AVL06","AVL07","AVL08","AVL09","AVL10","AVL11",
"AVF00","AVF01","AVF02","AVF03","AVF04","AVF05","AVF06","AVF07","AVF08","AVF09","AVF10","AVF11",
"V100","V101","V102","V103","V104","V105","V106","V107","V108","V109","V110","V111",
"V200","V201","V202","V203","V204","V205","V206","V207","V208","V209","V210","V211",
"V300","V301","V302","V303","V304","V305","V306","V307","V308","V309","V310","V311",
"V400","V401","V402","V403","V404","V405","V406","V407","V408","V409","V410","V411",
"V500","V501","V502","V503","V504","V505","V506","V507","V508","V509","V510","V511",
"V600","V601","V602","V603","V604","V605","V606","V607","V608","V609","V610","V611",
"JJ_Wave","Q_Wave","R_Wave","S_Wave","R_Prime_Wave","S_Prime_Wave","P_Wave","T_Wave",
"QRSA","QRSTA",
"DII170","DII171","DII172","DII173","DII174","DII175","DII176","DII177","DII178","DII179",
"DIII180","DIII181","DIII182","DIII183","DIII184","DIII185","DIII186","DIII187","DIII188","DIII189",
"AVR190","AVR191","AVR192","AVR193","AVR194","AVR195","AVR196","AVR197","AVR198","AVR199",
"AVL200","AVL201","AVL202","AVL203","AVL204","AVL205","AVL206","AVL207","AVL208","AVL209",
"AVF210","AVF211","AVF212","AVF213","AVF214","AVF215","AVF216","AVF217","AVF218","AVF219",
"V1220","V1221","V1222","V1223","V1224","V1225","V1226","V1227","V1228","V1229",
"V2230","V2231","V2232","V2233","V2234","V2235","V2236","V2237","V2238","V2239",
"V3240","V3241","V3242","V3243","V3244","V3245","V3246","V3247","V3248","V3249",
"V4250","V4251","V4252","V4253","V4254","V4255","V4256","V4257","V4258","V4259",
"V5260","V5261","V5262","V5263","V5264","V5265","V5266","V5267","V5268","V5269",
"V6270","V6271","V6272","V6273","V6274","V6275","V6276","V6277","V6278","V6279",
"Class_Nom"
)
cardiac_num <- cardiac[,grep("*Nom", colnames(cardiac), invert=T)]
# Some columns have max and min of zero, so are probably useless.
# It is possible that some groups of columns are actually time plots of
# data that pass through zero. Need to plot.
#cardiac_num <- cardiac_num[, apply(cardiac_num, 2, function(x) ! all(x == 0))]
# Clean out NA's from numeric data
# This reduces from 452 to 68 cases, so must find a way to handle missing
# data.
#cardiac_num <- na.omit(cardiac_num)
# We need to separate training and test sets. Would like the same sample
# so set a seed.
set.seed(1234)
index <- 1:nrow(cardiac)
trainindex <- sample(index, trunc(2 * length(index)/3))
cardiac_train <- cardiac[trainindex, ]
cardiac_test <- cardiac[-trainindex, ]
# Save unmodified, train, and test sets
write.table(cardiac, "cardiac.csv", sep=",", row.names=F)
save(cardiac, file="cardiac.RData", compress=TRUE)
write.table(cardiac_train, "cardiac_train.csv", sep=",", row.names=F)
write.table(cardiac_test, "cardiac_test.csv", sep=",", row.names=F)
# Separate train and test numeric data
cardiac_num_train <- cardiac_num[trainindex, ]
cardiac_num_test <- cardiac_num[-trainindex, ]
# Next we must construct a matrix of 0's and 1's with 1's at points where
# we have data, and 0's where we have NA's
# For the method in Coursera Machine Learning, we will need both
# train and test sets to have a matrix like this.
r <- rep(1, times=(nrow(cardiac_num_train)*ncol(cardiac_num_train)))
Rtrain <- matrix(r, nrow=nrow(cardiac_num_train), ncol=ncol(cardiac_num_train))
r <- rep(1, times=(nrow(cardiac_num_test)*ncol(cardiac_num_test)))
Rtest <- matrix(r, nrow=nrow(cardiac_num_test), ncol=ncol(cardiac_num_test))
# Replace 1's with 0's where there are NA's in cardiac data
Rtrain[is.na(cardiac_num_train)] <- 0
Rtest[is.na(cardiac_num_test)] <- 0
# Replace NA's with 0 for numeric data
cardiac_num_train[is.na(cardiac_num_train)] <- 0
# For the method in Coursera Machine Learning, we will need both
# train and test sets to have a matrix like this.
r <- rep(1, times=(nrow(cardiac_num_train)*ncol(cardiac_num_train)))
Rtrain <- matrix(r, nrow=nrow(cardiac_num_train), ncol=ncol(cardiac_num_train))
r <- rep(1, times=(nrow(cardiac_num_test)*ncol(cardiac_num_test)))
Rtest <- matrix(r, nrow=nrow(cardiac_num_test), ncol=ncol(cardiac_num_test))
# Replace 1's with 0's where there are NA's in cardiac data
Rtrain[is.na(cardiac_num_train)] <- 0
Rtest[is.na(cardiac_num_test)] <- 0
# Replace NA's with 0 for numeric data
cardiac_num_train[is.na(cardiac_num_train)] <- 0
cardiac_num_test[is.na(cardiac_num_test)] <- 0
# Write numeric tables
write.table(cardiac_num_train, "cardiac_num_train.csv", sep=",", row.names=F)
write.table(cardiac_num_test, "cardiac_num_test.csv", sep=",", row.names=F)
# Save work in R format
save(cardiac_num_train, cardiac_num_test, Rtrain, Rtest, file="cardiacSetup.rda")
# Clear test data from environment
rm(cardiac, cardiac_num, cardiac_test, cardiac_num_test, trainindex, index,r)
view raw Setup.R hosted with ❤ by GitHub
First, I tried running svd() on the dataset with all NA's removed. This reduced the dataset to only 68 complete rows, but it seemed likely to give me some idea of how many features there would be in an optimal basis.



Eyeballing things, it seems as though 10 to 20 features would be a reasonable reduced basis, at least from the complete data examples.  This is good, since it does not obviously conflict with the number of leads used in the apparatus, which is the reason to settle on this number of parameters. It seems that we should be able to apply the method of minimizing parameters to this data to recover the NA values.

So on to the code which should let us try this in R. First, a file of which specifies the cost and gradient functions for optim(), then a script to run them and see whether this approach recovers missing values which we have zeroed out ourselves for a test.

CollaborativeFiltering.R


# Cost and gradient functions for optim(). These are a port of Coursera
# Machine Learning course topic Collaborative Filtering from MATLAB
# to R.
# Cost Function
coFilterCost <- function (params, npatients, nleads, nreadings,Y, R, lambda)
{
# params: X and Theta flattened to vector
# npatients, nleads, nreadings: number of patients, leads, and numeric
# fields in data
# The values in X and Theta are what we are optimizing, so they need to be
# flattened into a vector of parameters for optim()
# X: estimated lead contributions to each recorded reading
# Numeric data fields X 12 leads + Lead[0] = 13
# Theta: estimated Cardiograph lead readings
# Number of Patients X 12 leads + Lead[0] = 13
# Y: actual patient data
# Number Readings X Patients
# R: matrix with 1's in positions of Y with patient data and 0's elsewhere.
# Numeric Fields X Patients
# lambda: regularization parameter.
# Add l[0] term for y-intercept
nleads <- nleads + 1
sx <- nreadings * nleads
st <- npatients * nleads
X <- matrix(params[1:sx], nreadings, nleads)
Theta <- matrix(params[(sx + 1):(sx + st)], npatients, nleads)
# Must transpose either R and Y or X %*% t(Theta)
S <- R * (X %*% t(Theta) - Y)
J <- sum(S^2)/2.0 # This is mean squared error
# This prevents params growing
J <- J + (lambda/2.0)*(sum(Theta^2) + sum(X^2))
return(J)
}
coFilterGrad <- function(params, npatients, nleads, nreadings, Y, R, lambda) {
# Add l[0] term for y-intercept
nleads <- nleads + 1
sx <- nreadings * nleads
st <- npatients * nleads
X <- matrix(params[1:sx], nreadings, nleads)
Theta <- matrix(params[(sx + 1):(sx + st)], npatients, nleads)
Theta_grad <- matrix(0, npatients, nleads)
X_grad <- matrix(0, nreadings, nleads)
S <- R * (X %*% t(Theta) - Y)
X_grad <- S %*% Theta
Theta_grad <- t(S) %*% X
X_grad <- X_grad + (lambda * X)
Theta_grad <- Theta_grad + (lambda * Theta)
# Flatten everything to a vector as in octave. May be wrong idea.
grad <- c(as.vector(X_grad), as.vector(Theta))
return(grad)
}
normalizeY <- function(Y,R) {
# Averaging a quantity changing through time is doubtful. Some readings
# might possibly be voltages at intervals or something of the sort.
# Dividing by max(row value) seemed to cause optim() to zero all values in
# the Theta matrix. This might be due to linear dependence, since it is
# multiplying all values by a constant, but I am not sure. Subtracting
# the average also causes optim() to zero out the Theta matrix as returned
# in $par, but I should normalize the Y matrix somehow, not quite sure what
# will work.
rc <- dim(Y)
Ynorm <- matrix(0, rc[1], rc[2])
nrm <- vector(length=rc[1])
# Need average of each feature. Y is nreadings X npatients
# so need average across rows
for( i in 1:rc[1]){
idx <- which(R[i,] == 1)
if(length(idx) == 0) {
nrm[i] <- 0
}
else {
nrm[i] <- mean(Y[i,idx])
}
Ynorm[i, idx] <- Y[i, idx] - nrm[i]
}
yl <- list(Ynorm, nrm)
return(yl)
}
restoreY <- function(Ynorm, nrm) {
rc <- dim(Ynorm)
Y <- matrix(0, rc[1], rc[2])
for(i in 1:rc[1]) {
Y[i,] <- Ynorm[i,] + nrm[i]
}
return(Y)
}
Test.R

# Try setting up a smaller test set
# Variables to set how large a square of data to test
npatients <- 4
nreadings <- 8
# We are still not converging in 500 trials. Perhaps we have too many unknowns.
#nleads <- 10
nleads <- 12
load("cardiacSetup.rda")
# Chop a human friendly size chunk out of the data
Y <- as.matrix(cardiac_num_train[1:npatients, 1:nreadings])
#R <- Rtrain[1:npatients, 1:nreadings]
# Machine learning course seemed to be setting things up like:
# data X source of data, ie, movie ratings X users, for Y. Thought this might
# be the reason that optim() was zeroing parameters in the second matrix Theta, but probably it is
# zeroing linearly dependent columns, maybe.
Y <- t(Y)
# Zero out same data each time
set.seed(4321)
# May be better way to do this, more like MATLAB
#y <- as.vector(Ynorm)
y <- as.vector(Y)
y[runif(length(y)) > 0.5] <- 0
Yn <- matrix(y, nreadings, npatients)
R <- matrix(0, nreadings, npatients)
R[Yn != 0] <- 1
yl <- normalizeY(Yn, R)
# X and Theta are actually initialized with random data.
# Each reading recorded was created by some mixture of the various leads
X <- matrix(rnorm(nreadings*nleads), nrow=nreadings, ncol=nleads, byrow=T)
# Each patient had a particular set of lead readings
Theta <- matrix(rnorm(npatients*nleads), nrow=npatients, ncol=nleads, byrow=T)
# Must add y intercept parameter to X and Theta
X <- cbind(1, X)
Theta <- cbind(1, Theta)
source("CollaborativeFiltering.R")
params <- c(as.vector(X), as.vector(Theta))
# This seems to run, but it gives me a 1 on convergence rather than a 0.
#p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Y, R=R, lambda=0)
# Using method="CG"
#p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="CG", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Y, R=R, lambda=0)
# Using method="BFGS". Seems to converge more reliably.
# This is normalized by subtracting the row mean from each feature. This causes optim() to zero out all values in the Theta
# matrix as returned in pn$par so nothing further can be done.
pn <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="BFGS", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=yl[[1]], R=R, lambda=0)
p <- optim(par=params, fn=coFilterCost, gr=coFilterGrad, method="BFGS", npatients=npatients, nleads=nleads, nreadings=nreadings, Y=Yn, R=R, lambda=0)
# convergence 1 means that iteration limit has been reached without
# convergence.
sx <- nreadings * (nleads+1)
st <- npatients * (nleads+1)
Xp <- matrix(p$par[1:sx], nreadings, (nleads+1))
Thetap <- matrix(p$par[(sx + 1):(sx + st)], npatients, (nleads+1))
Yp <- Xp %*% t(Thetap)
cat("Convergence:\n")
cat(p$convergence)
cat("\n")
cat("MSE:\n")
cat(sum((Yp - Y)^2)/(nreadings*nleads))
cat("\n")
view raw Test.R hosted with ❤ by GitHub
The functions in CollaborativeFiltering.R are a basic translation of what I wrote in the Coursera Machine Learning class from MATLAB, or octave, to R.

Regrettably, none of this worked quite as I had thought. First, I had a bit of a struggle with optim() just understanding how one should use it with two entire matrices as a parameter. I decided that I should just flatten everything to a vector, just as I had in octave. Then, much more confusing, I found that optim() was zeroing out the entire second matrix. Since the process requires that one multiply matrix X * Theta' to regenerate the data matrix, this would not work at all. I found that when I fed the data in without any sort of normalizing, (which was required at least by the course lectures), I got back two matrices, but when I touch the data in any way, (either by dividing by the max of each feature or subtracting the average), the second matrix came back filled with zeros. This was quite useless for a method that required multiplying one matrix by another.

Since I wasn't having success with R, I decided to return to the original octave code, and see whether the method worked on this data. Unfortunately, it did not work there any better. The code ran and produced a matrix of the proper dimensions of cardiac readings by patients, but the numbers were not even the same order of magnitude at many points in the regenerated matrix.

Some things which occur to me:
The movie score matrix was very simple. It was composed of integers between 1 and 5 with many missing data points. Next, the problem that was being solved wasn't actually restoring data, it was yet another classification problem. There are two classes which a movie can belong to in this problem, these being interesting, and uninteresting. We don't need to accurately construct an unknown score, we only need to sort this particular movie into one class or the other. I hadn't been aware that I was actually solving a classification problem while I was working through the code for class. In fact, classification problems seem to be the only sort of problems Machine Learning concerns itself with at least by the course material for this class. This again is something that I had not noticed until I started working on this data set. Possibly this method could be improved by adding another variable like the lambda to push down on the mean squared error section of the code, but it would probably be best to try another approach. I read of a method that involved reconstructing a matrix by minimizing the sum of the singular values returned by svd(). This seems more likely. I might try this, as well as trying a neural net and K-means on this data set if I have some more free time.