MathJax

MathJax

Tuesday, June 24, 2014

Trying out some techniques from Coursera Machine Learning Class

I finished the Machine Learning course from Andrew Ng on Coursera recently. The last topic he covered in the course was on recommender systems. There was a technique he described that involved multiplying a matrix of user preferences by a matrix of movie scores to fill in data on movies which had not been rated, and to allow the system to make recommendations for movies the users had not seen.  It immediately occured to me that this should be a general method one could use to fill in missing data in many sorts of situations.  I decided that I would need to investigate the technique further after the class ended.  I would try to find some relatively simple data file and see how one would go about applying the technique to it.  During a lecture, professor Andrew Ng mentioned that the technique he was describing was known as Low Rank Matrix Factorization.  I did a bit of web searching and found more information under "Low Rank Matrix Reconstruction."

When I had some free time, I went straight to the UC Irvine Machine Learning Lab website to see if I could find some relatively simple, real world data to try the technique on.  I found a data file on cardiac arrhythmia which seemed about right for a test: not too big, a fair amount of missing data, mostly numeric, and possibly low rank.  I know nothing about electro-cardiogram procedures, but thanks to Wikipedia, I discovered that they involved recording data through twelve leads.  The data file itself has 279 attributes recorded for 452 individuals.  In principle, all these 279 attributes should be created by some mixture of the twelve variables of the different leads, so this data should definitely be a low rank matrix, and possible to reconstruct using one method or another.

First I needed to do some data munging on the file. Beginning with code I found at:

http://www.togaware.com/datamining/survivor/Cardiac_Arrhythmia.html


I wrote some code in R to label all the variables, remove non-numeric columns, split the sample into training and test sets, and save the results into Rdata files.

Setup.R

First, I tried running svd() on the dataset with all NA's removed. This reduced the dataset to only 68 complete rows, but it seemed likely to give me some idea of how many features there would be in an optimal basis.



Eyeballing things, it seems as though 10 to 20 features would be a reasonable reduced basis, at least from the complete data examples.  This is good, since it does not obviously conflict with the number of leads used in the apparatus, which is the reason to settle on this number of parameters. It seems that we should be able to apply the method of minimizing parameters to this data to recover the NA values.

So on to the code which should let us try this in R. First, a file of which specifies the cost and gradient functions for optim(), then a script to run them and see whether this approach recovers missing values which we have zeroed out ourselves for a test.

CollaborativeFiltering.R


Test.R

The functions in CollaborativeFiltering.R are a basic translation of what I wrote in the Coursera Machine Learning class from MATLAB, or octave, to R.

Regrettably, none of this worked quite as I had thought. First, I had a bit of a struggle with optim() just understanding how one should use it with two entire matrices as a parameter. I decided that I should just flatten everything to a vector, just as I had in octave. Then, much more confusing, I found that optim() was zeroing out the entire second matrix. Since the process requires that one multiply matrix X * Theta' to regenerate the data matrix, this would not work at all. I found that when I fed the data in without any sort of normalizing, (which was required at least by the course lectures), I got back two matrices, but when I touch the data in any way, (either by dividing by the max of each feature or subtracting the average), the second matrix came back filled with zeros. This was quite useless for a method that required multiplying one matrix by another.

Since I wasn't having success with R, I decided to return to the original octave code, and see whether the method worked on this data. Unfortunately, it did not work there any better. The code ran and produced a matrix of the proper dimensions of cardiac readings by patients, but the numbers were not even the same order of magnitude at many points in the regenerated matrix.

Some things which occur to me:
The movie score matrix was very simple. It was composed of integers between 1 and 5 with many missing data points. Next, the problem that was being solved wasn't actually restoring data, it was yet another classification problem. There are two classes which a movie can belong to in this problem, these being interesting, and uninteresting. We don't need to accurately construct an unknown score, we only need to sort this particular movie into one class or the other. I hadn't been aware that I was actually solving a classification problem while I was working through the code for class. In fact, classification problems seem to be the only sort of problems Machine Learning concerns itself with at least by the course material for this class. This again is something that I had not noticed until I started working on this data set. Possibly this method could be improved by adding another variable like the lambda to push down on the mean squared error section of the code, but it would probably be best to try another approach. I read of a method that involved reconstructing a matrix by minimizing the sum of the singular values returned by svd(). This seems more likely. I might try this, as well as trying a neural net and K-means on this data set if I have some more free time.