A more perfect vacuum: 2014

Wednesday, November 5, 2014

A Discovery That Could Form The Core of a Workable Quantum Computer

I read something recently which I could tell had some unexpected possibilities beyond what one might have immediately thought. Research at Brown University which involved sending a stream of electrons into a container of liquid helium. There is a quantum interference at the surface of the liquid helium with some section of the wave function reflecting from the surface and some section entering the liquid helium where it forms an actual observable bubble strangely enough. Here is a link: Can the Wave Function of an Electron be Divided and Trapped? After I thought about the article for a day or so, I began to see that what they were describing was actually an extremely stable system that was performing a quantum calculation, in short, exactly the sort of system which could form the basis for a quantum computer. But how could this device they describe be turned into some sort of controllable, programmable calculator? Just as it stands, it seems that the device can probably be used to do factoring, so it could break cryptography. But it has more possibility than this. The injection of the electron and the detection of the electron wave sections at the bottom should be relatively simple electronics, so one does not need a quantum device to program or insert data into one's quantum calculator. The input can be controlled by relatively simple electronics which are controlled by a standard computer. The velocity with which the electron is injected can doubtless be controlled within some range to alter the quantities the device calculates. One can easily imagine a bank of these devices, say 16 or 1024 running in parallel with their input controlled by a single standard computer, and the output interpreted and fed back into the input for further calculation. The devices could even be clustered so that electron wave function of several tubes could interfere with each other allowing another level of calculation. There is even the possibility that one could put a detector on the input tube that detected electrons heading back up the tube. If the experiment behaved like the dual slit experiment, one should see the wave function collapse into two possible states, so one should have something like a switching function.

Sunday, September 28, 2014

Basic Tutorial for Puppet on KVM inside CentOS guest

I've been reading a bunch of posts lately on the twitterverse going on about how impossible it is to get any modern, sexy technology to run on RedHat or CentOS. Well, that probably isn't strictly true I thought. Since even Puppet was on some of these lists, and I had already been planning work on figuring out how Puppet worked, I thought I might have a go. Why not try installing it on a CentOS 6.5 virtual machine first, just to try it out and figure out how to do some basic operations?

Puppet is a tool that can be used to control the configurations of a number of machines from a central server. This is something that seems quite useful, but just reading odd posts online won't necessarily make things clearer. It seems to be some sort of complex thing that involves ruby doing something to do with computers. Meanwhile, the documentation on puppetlabs.com gives a clear description of how to install, but then wanders off into the details of config files that don't seem to apply to anything too comprehensible. The best documentation on puppetlabs for setting up a basic demonstration is outdated and no longer even runs on a current puppet installation without modification. Even after the simplest fix, it still spits out five or six deprecation warnings. So, let's try to fix some of this and get a basic system of two virtual machines running to try puppet out.

First, you will need to have a virtualization system installed. I'll be working
with libvirt and kvm installed on Debian 7. This is set up so that the guest machines are connected to each other and the outside world through a bridge interface running on the host system. This seems to be the best setup for a system that behaves like a real network, but the setup can be a bit confusing, perhaps I can manage a blog post about this as well at some point.

First we need to create a CentOS guest for the puppet master, that is, the machine that will be used to control the configurations of our other machines. Let's try it using the virt-install command. (You need to be root to use virt-install).

root@veryfine:~# virt-install --connect qemu:///system -n pfennig -r 1024 --vcpus=1 --accelerate \
--disk path=/var/lib/libvirtimages/pfennig.img,size=8 \
--network bridge=br0 --graphics vnc \
--cdrom=/var/lib/libvirt/iso/CentOS-6.5-x86_64-minimal.iso

This command has quite a bit going on so I will try to explain it a bit. We will be creating a machine named pfennig with 1024Mb of ram. It will use a bridge called br0 for its network interface. The console and graphics will be passed out over vnc. We will be installing from an iso image file of CentOS-6.5 and we will be installing to a disk image file with a size of 8GB. This should be run at the host machine where it will automatically pop up an install screen like:

It is perfectly possible to do this remotely, but might as well start with the easy way first. Then just work through the steps of the install with only one particular point to simplify things. Be sure to click on the button which allows you to configure the network.

Then set the virtual machine to a fixed ip address in a range that you have reserved from your router's dhcp. This will save you some steps learning or relearning the various files that RedHat uses to configure network interfaces.

We will need a machine for our puppet server to manage, so let's clone pfennig to speed things up and save another install:

# virt-clone --connect=qemu:///system --original=pfennig --name=farthing \
--file=/var/lib/libvirt/images/farthing.img

This should print out a progress bar and run for a while since it is copying several GB of disk image.

Next we might want to do some things to set up a low bandwidth method of managing the machine using the console command from inside the virsh command shell.

The virsh command is a utility which lets you manage virtual machines from the command line. It must be run as root, but it makes it possible to ssh in remotely and undertake almost all virtual machine management from your favorite coffee shop, even if your ISP isn't very good about giving you upstream bandwidth. In my case, so far I haven't gotten vnc to be useable for much of anything, but ssh and virsh work fine.

When you ssh into your main host, su root and type "virsh", you will see something like:

virsh#

This is the virsh command prompt. Here you can type commands like "start pfennig" for example to start the machine we created. You can also type "console pfennig", and if pfennig is running you will see a blank screen with a message across the top saying that the escape character is ^-]. No login prompt will appear. We need to fix this. This console which virsh is giving us is a serial console, so we will need to set the virtual machine up to start agetty on ttyS0.

Start the machine from virsh and then get the console which it has exported over vnc by typing

$ vncviewer localhost

In another terminal. This should bring up a vnc window of the virtual machine console. Log in with the password you gave for root during the install. Install setserial to check what serial port has been initialized by the virtual machine, (almost certainly ttyS0).



# yum install setserial

Then to check:

# setserial -g /dev/ttyS[0123]
/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
/dev/ttyS1, UART: unknown, Port: 0x02f8, IRQ: 3
/dev/ttyS2, UART: unknown, Port: 0x03e8, IRQ: 4
/dev/ttyS3, UART: unknown, Port: 0x02e8, IRQ: 3

This output shows that ttyS0 is the only working port. Now we need to edit /etc/securetty and create /etc/init/ttyS0.conf. Append ttyS0 to the end of /etc/securetty and create /etc/init/ttyS0.conf so that it looks like:

# /etc/init/ttyS0.conf
start on runlevel [345]
stop on runlevel [S016]

respawn
instance /dev/ttyS0
exec /sbin/agetty ttyS0 19200 vt100-nav

Next, we need to edit one more file. Though we told the install to create a network connection, and gave it an address, it evidently does not believe it should start the thing on boot. Probably I missed a check box. So now lets edit /etc/sysconfig/network-scripts/ifcfg-eth0. We need to change the line:

ONBOOT=no

To:

ONBOOT=yes

Restart the virtual machine by running /sbin/shutdown -r now from the console you have open and when it restarts try "console pfennig" from the virsh shell. Once again you will see a blank screen, but if you hit return you should get a login.

virsh # console pfennig
Connected to domain pfennig
Escape character is ^]

CentOS release 6.5 (Final)
Kernel 2.6.32-431.29.2.el6.x86_64 on an x86_64

pfennig login:

Now, we will be working from the installation guide for puppet on PuppetLabs, but skipping a few steps that aren't really need to just test things.

First, we will need to punch port 8140 through the firewall the CentOS runs by default for puppet to work.

# iptables -n -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED 

ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

# iptables -I INPUT 2 -p tcp --dport 8140 -m state --state NEW -j ACCEPT

# /sbin/service iptables save

First we list all the firewall rules with iptables -n -L, then we insert a rule to allow new connections on port 8140 as the second rule, and then we save the rules so they will come back after a reboot. We need to do this on every host we want to control using puppet, so we will need to open port 8140 on farthing as well when we get that far.

We will need to set up /etc/hosts on pfennig and on every other host that will be using puppet. The right way to do this would be to set up a DNS server, or use dnsmasq, but we are just trying to get a test system up and running and see a bit about how it works. To make this happen we need an alternate name of puppet for our virtual machine on every host that will use pfennig for a puppet master server.

# /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.161   pfennig.homenet puppet
192.168.0.162   farthing.homenet farthing

After we have completed these preliminaries, we need to install a repo for puppetlabs, and then install puppet on pfennig. We need to install an rpm file to enable the puppetlabs repo. Since we are installing on CentOS 6 the command is:

# rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm

Then we need to install the puppet master server

# yum install puppet-server

Don't start anything yet, we need to do a few things. First, we need to edit /etc/puppet/puppet.conf. Look for the [main] section and set the variable dns_alt_names so it will take you to your virtual guest machine, so something like this but corrected for whatever names you have chosen:

dns_alt_names = puppet,pfennig,pfennig.example.net

Then we need to create a certificate on the puppet master server.

# puppet master --verbose --no-daemonize

This will print out a message Notice: Starting Puppet master version , at which point one can hit ctrl-C to kill the server. This is now a working test puppet master server. The guide on puppetlabs will insist that you need to install a production web server. Don't do this for a test system like this. The "Puppet with Passenger" install died in the middle of the install at least on Debian 7 for me, and I had to go through a painful process to remove the apache version that had been installed. We don't need any of this to test puppet on a couple of virtual machines, since puppet includes its own built in web server that will do fine to check things out.

Now, we need to start up farthing and see if we can get it running as a puppet agent node. Hit ctrl-] and you will be back at the virsh # prompt. From here simply type "start farthing" to get the guest we cloned earlier running, then wait for 10 or 15 seconds to be sure it has booted.

Since we haven't got our serial terminal set up, we will need to use virt-viewer or vnc to continue with our setup on farthing. Let's try accessing it remotely with ssh just to see how this works. (You will of course need to forward port 22 through your router first. Using ssh with secure key login is also a very good idea). First we need to truck off to the local coffee shop with our laptop then we'll type:

ssh youruser@your.external.host.or.ip.com -L 5901:127.0.0.1:5901

And then in a different terminal of your laptop:

vncviewer localhost:5901

The reason for using port 5901 is that farthing is the second machine we started and so should use the second vnc port. If everything goes as planned, you should see a vnc terminal with a login for farthing. Now, we need to repeat some of the steps we took to install puppet on the puppet master server, and then start things up and get pfennig to sign a certificate from farthing. Login to the vnc window for farthing as root, and install the repo for puppetlabs:

# rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm

# yum install puppet

Now you need to get the console for pfennig as well as farthing so that you can start the puppet master server if it is not already running and then send a certificate request from farthing. A simple secure shell into your host will let you run virsh, or you can forward port 5900 using the command above to start a second vnc tunnel. I kind of like using the console from virsh, since the overhead seems lower.

Start puppet on pfennig with the command: puppet master as root. Then move to the terminal for farthing and type: puppet agent --test This will send a certificate request to the puppet master server on pfennig. Now, move back to the terminal of pfennig and type: puppet cert list. You should see the request from farthing for a certificate. Now sign the certificate request from farthing with puppet cert sign farthing. Now, everything should be ready to go, we can start writing configs on pfennig and seeing if they get pushed out to farthing.

Now, we will build a directory tree on pfennig and create several files that should install setserial and the two config files we changed to get a serial terminal working. We will end up with a directory tree that looks like this:

Tree of /etc/puppet Files
 |-manifests-|-site.pp
 |
 |-modules---|-setserial-----------------|-files-----|-securetty
                                    |                |-ttyS0.conf
                                    |
                                    |-manifests-|-init.pp

This is a modified version of instructions I found on the puppetlabs website: http://projects.puppetlabs.com/projects/puppet/wiki/Advanced_Puppet_Pattern. Unfortunately, the site.pp file they give there does not run, and even after you fix the main error, it produces a half page of deprecation warnings. So I tried to fix things here as far a I could. First, we need to build the directory tree. The modules and manifests directories should exist from the install, but there will be nothing under them, so we need to create a directory called setserial under the modules directory to hold our files, that is, the securetty file and the ttyS0.conf file, and the init.pp file that will control what is installed.

# mkdir /etc/puppet/modules/setserial
# mkdir /etc/puppet/modules/setserial/files
# mkdir /etc/puppet/modules/setserial/manifests

# cp /etc/securetty /etc/puppet/modules/setserial/files
# cp /etc/init/ttyS0.conf /etc/puppet/modules/setserial/files

# vi /etc/puppet/modules/setserial/manifests/init.pp

We need to create init.pp so that it looks like this:

#/etc/puppet/modules/setserial/manifests/init.pp
class setserial {

  package { setserial: 
              ensure => latest,
              allow_virtual => false,   
  }

  file { "/etc/securetty":
    owner => 'root',
    group => 'root',
    mode => '0400',
    source => "puppet:///modules/setserial/securetty",
    require => Package["setserial"],
  }
  file { "/etc/init/ttyS0.conf":
    owner => 'root',
    group => 'root',
    mode => '0444',
    source => "puppet:///modules/setserial/securetty",
    require => Package["setserial"],
  }
}

Next, we need to create the file site.pp. They say that this does not need to be a single file, but can be a directory instead. No sort of examples of this seem to be provided, only the somewhat opaque man page sort of documentation, so I won't begin to try showing how to do this.

# vi /etc/puppet/manifests/site.pp


# File - /etc/puppet/manifests/site.pp

# import node defs so they don't all have to be in site.pp
# Unfortunately, importing nodes is deprecated
#
#import "nodes"
#
# All nodes must be defined here if one defines any nodes in this file

# This pulls a module in and makes it default for all nodes.
# It may be required.
node default {
  include setserial
}
#
# This creates a node and installs some custom utils all from this file.
# This works if you wish to manage everything from this one file.
node 'farthing' {
  include custom_utils
  # This should be handled by the default node
  #include setserial
}

class custom_utils {
  package { ["nmap", "traceroute", "vim-enhanced"]:
    ensure => latest,
    allow_virtual => false,
  }
}
# Try to manage pfennig which is puppet master server
# This also works.
node 'pfennig' {
  include custom_utils
}

Next, we need to test our setup out. On pfennig run:

#  puppet apply --noop /etc/puppet/manifests/site.pp

This will compile our manifests, but not actually run or install anything. If everything works you can type:

# puppet apply /etc/puppet/manifests/site.pp

In the present case, this should install traceroute, nmap and vim on pfennig. Next, we want to see if we can get setserial and our config files installed on farthing and get a serial console working, so we go to our vnc console for farthing and type:

puppet agent -t

The puppet agent node should request updates every 30 minutes by default, but setting all this up would next stage if you were really going manage a cluster of machines. We just want the thing to show us what it can do at the moment. With some luck farthing will now have setserial and the required config files installed. We will need to run: CHECK THIS

# service ttyS0 start

or reboot farthing to actually use the serial terminal. Now, something that I discovered that shows a bit more of the slickness of puppet. When I first starting working on this post, I was thinking mostly of using puppet to installing elasticsearch. To install elasticsearch one must install java first. Puppet has a module to install java which can be installed quite simply:

# puppet module install puppetlabs-java

This installs a module from puppetlabs without any further work on your part, then to use it we need to update /etc/puppet/manifests/site.pp

# Add this class to site.pp since elasticsearch requires jdk and the module installs jre by default.
class { 'java':
        distribution => 'jdk',
}

# Alter the node definition of the host you wish to install java on so that it looks like this
node 'searchhost' {
  #include other_modules
  include java
}

I hope this should be some use to people just trying out puppet, and give some idea of how it could be used.

Friday, September 5, 2014

Implementing the Bayes Spam Filter from the book "Doing Data Science"

I decided to have a try at implementing the spam filter described in "Doing Data Science" by Cathy O'Neil and Rachel Shutt. The book gives a fairly clear description of Bayes Theorem, and a bash script which can calculate the probability of individual words being spam, but does not give any real direction for actually coding a multi-word Bayes Filter. They do give the formula that such a filter will be based on, but don't really give any clues on how one would code it, apart from solving it out by taking the log of both sides and then proving that Laplace smoothing is valid. This is not quite as user friendly as one might wish, but it does leave plenty to try out and learn.

First, we will need some data so we might as well fetch the Enron emails that they are using in the book for the bash script.

wget 'http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz'

I am assuming that everyone is running linux here and has the wget command installed. One could do this just as well from inside R. Next we will need to unpack the data.

tar xzvf enron1.tar.gz

This will unpack everything into a directory called enron1, with a directory called spam and another called ham underneath it. All the email messages are stored as text files within each directory, one per message. Now, the formula they give for a multi-message spam filter looks like this:

$$p(x|c)=\prod_j\theta_{jc}^{x_j}(1-\theta_{jc})^{(1-x_j)}$$

You can code the formula directly in R with something like this:

pxGivenC <- function(theta, x) {
  not_x <- 1 - x
  not_theta <- 1 - theta
  pxc <- 1
  for(i in length(x)) {
    pxc <- theta[i]^x[i] * not_theta[i]^not_x[i]
  }
  return(pxc)
}

But this will not work as well as one might hope because the differences in probabilities become so tiny that they fall off the end of the floating point implementation. If one tries this approach on the Enron dataset, the model guesses that everything is is ham since the ratio of ham to spam is about 0.71 to 0.29. The book recommends taking the log of both sides and then solves out sections of the resulting formula which do not change for individual email messages, but it doesn't really explain where the formula came from. So let me try to explain what I have managed to figure out so far. You might want to download a reference given in the book, Spam Filtering with Naive Bayes - Which Naive Bayes? This was a math heavy explanation of the various different ways one can implement Bayes' Theorem, but it explained where the equation above came from, which "Doing Data Science" did not.

In the book they take the log of both sides of the formula, which is apparently necessary to have any useful output, and get:

$$\log(p(x|c))=\sum_jx_j\log(\theta_j/(1-\theta_j))+\sum_j(\log(1-\theta_j))$$

Most of this formula doesn't vary with the individual messages, so it can be extracted and solved once ahead of time:

$$w_j=\log(\theta_j/(1-\theta_j))$$

$$w_0=\sum_j(\log(1-\theta_j))$$

So substituting we get:

$$\log(p(x|c))=\sum_jx_jw_j+w_0$$

Now, what does all this mean and what have we calculated in relation to Bayes' theorem? First, we haven't calculated what we were actually looking for which would be p(spam|word) for all the words in an email message taken together, instead we have calculated p(word|spam) with "word" meaning the probability of finding this collection of words at least once in an email message we know to be spam. This is evidently the hard part according to the book, but in my case at least, I found a few other confusing things.

First, let's go over all the steps we need to take before we can even apply this formula.

We need to count all the words that occur in spam and ham email and turn this into a dictionary with the counts of each word. But, this is not a count of how many times we found a word in a spam message and how many times we found a word in a ham message. Instead, we are counting the number of spam messages we found this word at least once in, and then counting the number of ham messages we found the same word at list once in and saving these counts. This information is used to construct theta in the formula above. Theta is a long vector of probabilities of finding each and every word in the dictionary in a spam email. According to the book, it is constructed by this formula: (count of spam emails containing word)/(count of all emails containing word). On the other hand, the reference the I copied above gives the formula as (count of spam emails containing word)/(count of spam emails). The second form would seem to be the actual probability of p(word|spam), while I haven't quite figured out exactly what the first form is. Maybe it allows one to calculate the denominator of Bayes' Theorem more easily, but the authors do not explain this in the book.
After we have built a great deal of infrastructure to let us read in all the email messages of both classes and count how many messages of both classes contain each word, we need to calculate theta by either one or the other formula, and then calculate w and w0 for our training set. All these counts and variables are determined by our training data and have no dependence on any testing email we are classifying. As near as I can see, the process of counting the number of emails of our training set in which we find each word is the entire training process.
Next, we need to read in a testing email, parse it into individual words, and construct an x vector of zeros which is as long our dictionary. We will flip the x vector to a one at every index of a word in our dictionary which we find in the email message we are testing.
Now we can finally load the values into the formula above and derive the equivalent of log(p(word|spam)) for all the words in our test email message. This produced outputs like -1084.43 for me, as well as many cases of -Inf and NaN. In order to avoid the values of NaN and -Inf, one must ensure that none of the words have a probability of zero with regard to either ham or spam messages. To do this, one must alter the formula used to calculate theta by adding a small alpha and beta to the numerator and denominator of the ratio of counts. According to the book, alpha and beta should be some value < 0.2, while according to the online source alpha = 1 and beta = 2 is the correct choice. The first problem of the relatively large negative number is more perplexing. This is the log of a probability, so to calculate the probability of finding our email message x given that we knew it was spam we will have to take the exponential of -1084.43, and this will be zero according to R. Assuming we have set up everything right on the way to here, we will need to find some way of turning this into p(spam|x) that doesn't just involve running it through exp().
Since we can't just take the exponential we will need to keep everything as a log until we solved the denominator of Bayes' Theorem, and then hopefully we will have something that will be a more likely probability. Looking online I find a log identity that seems useful: log(a + b) = log(a) + log(1 + exp(log(b) - log(a))). This can be used to construct the denominator of Bayes' Theorem without taking the anti-log, and so after a few algebra incidents we can see things which look like valid probabilities come out of our little dummy Bayes' Classifier.

This is the code for a little dummy program which implements everything appart from the parsing. The code for the entire project is on github at https://github.com/joelkr/BayesFilter.git. The code is in R and it works, but the section which parses and builds the dictionary is seriously slow. So here is something I wrote to try to figure out why I wasn't getting valid probabilities out of my huge Rube Goldberg parsing contraption. It creates a dummy dictionary out of two letter tokens and then constructs some email vectors with spam emails mostly having odd tokens and ham messages mostly having even tokens:

Tuesday, July 22, 2014

Using Python to Model a Levy Path

I have been on a bit of a project to broaden my mind and revist things which I had studied years ago using Coursera. As part of this, I took a course on Statistical Mechanics that was offered by Professor Werner Krauth at the Ecole Normale et Superieure in Paris. (Taught in english fortunately for me, as I wouldn't have even thought of trying to deal with it in french - my high school french is completely lost to me). Even in this case I was struggling a bit, and only got to the middle of the course before I decided that I was too far behind to catch up. I saved all the homework exercises and lecture videos, and resolved that I would come back and work through them as soon as I had the chance. The course involves describing physical systems by modeling them as systems of statistical behavior, rather than the deterministic models one might think of constructing using freshman physics. Strangely, the results produced by these models are identical in many circumstances to those produced by deterministic models. The course was taught using python as an environment to construct the models, which was a good choice for producing simple code which one could run and modify on the spot, but it put me in a little bit of pain moving back and forth between R, octave, (an opensource MATLAB clone), and python with the various courses I was taking. I found myself continually wondering, now what was the function called that did that again? And what were the arguments?...evidently not those...,hm. I very much wanted to settle on one particular language and not jump back and forth, though I still haven't figured out which one of them that would be. The course began with an introduction to Markov chain methods and simulations, something which I had vaguely heard of, but never really known what people were talking about, and then moved into modelling simplified physical systems using Markov methods. I got to the first section of Quantum Statistical Mechanics and then decided that I was not going to be able to keep up with the material, so now I have been working through it more gradually. The section I am working on now involves Levy paths. These are something else I vaguely remember hearing about without having any idea what they were. They evidently appear in models of phenomena of all sorts, from quantum mechanics to fiancial analysis.

This course described them as being a simpler way of obtain essentially the same output as the Feynman path integral, and, at least in the demo program, it really isn't different from a random walk with each change of direction being chosen from a normal distribution. In fact, a levy path can be constructed this way, though one must correct back to the final x end position and move each x slice back proportionally as well.

# Add a random normal with mean of x-previous.
Upsilon.append(random.gauss(Upsilon[-1], sigma))

# Subtract the distance between upsilon_final and xend
x = [0.0] + [Upsilon[k] + (xend - Upsilon[-1]) * \ k / float(N) for k in range(1, N + 1)]

One can see that the last position selected becomes the mean for the distribution from which the next position is selected. In the class we were provided with a Markov implementation of the path integral first before learning about Levy paths. The implementation involved generating proposed x positions from a random uniform distribution, calculating new weight and old weight, and then throwing away everything that did not fit into a random normal distribution. This, of course, makes no particular sense as an algorithm, but it is exactly how a Markov method works. One does not know anything about the probabilities of the solution space. One simply walks randomly around the solution space, and calculates the values of randomly chosen points. The wrapping algorithm that I was supposed to implement for this exercise caused me some confusion as well. One was supposed to implement some sort of list wrap like:

L = L[3:] + L[:3]

The first time I worked through the exercise, I did not understand that the previously generated value was being used to generate the current value, so things were just being scattered on a normal distribution, rather than gradually tending toward the value of xend. I also did not see how large versus small beta affected the calculation until I did a graph of the internal variables of levy_harmonic_path(). I still am not sure I am gaining anything by wrapping the list. I had thought it would be necessary for the sort of Markov method implementation that would be needed to simulate a Bose-Einstein condensation. One would just have the program grind round a million times and it would never be able to leave the list, but I am not really sure things work that way with this particular function - when you call it, it generates a new list every time, and overhead seems quite high for no particular gain. On the other hand, maybe I am just not quite understanding what they have in mind. Now, let's have a look at the inside of the function that is supposed to construct a Levy path and see if any sense can be made of it. The function is called like:

levy_harmonic_path(xstart, xend, dtau, N)

xstart and xend are the initial and final positions of the particle of course. I found myself considerably confused by the lecture's insistence that the particle begin and end at the same position. The code given did not have any such requirement, but it is probably needed for a Markov simulation which always seems to run inside a boundary-less box with particles leaving from one side reappearing on the other. Dtau is beta divided by N, so it represents the temperature loss with each slice of the simulation. N is the number of slices we have divided the simulation into. In order to calculate a possible x position at a given slice we need to have the previous x position and the final x position. These will be used to calculate the mean and standard deviation of a normal distribution from which a random selection will be made for the new x position on the current slice. We are only moving in one dimension here, but we have a second dimension because we are moving from high temperature, or low beta, to low temperature, or high beta. The position of the particle becomes more and more uncertain as beta gets larger, but it has a definite position at xstart and xend, at least as far as this code is written. (This might be another reason why the code must be altered to wrap, particularly if one were to observe a Bose-Einstein condensation). This particular function models the behavior of a particle in a potential well, which means that a force is acting on it to keep it near x = 0. Now, when considered statistically, a force means that one is more likely to be found in one region than another, rather than some sort of causal factor producing a particular trajectory. This may seem obvious, but I had to think a bit before it occured to me.

Ups1 = 1.0 / np.tanh(dtau) + 1.0 / np.tanh(dtau_prime)

Ups2 = x[k - 1] / np.sinh(dtau) + xend / np.sinh(dtau_prime)

mean = Ups2/Ups1

sd = 1.0/np.sqrt(Ups1)

This is how the function calculates the mean and standard deviation of the distribution from which to make a random selection for the x position of the current slice.

With dtau_prime = (N-k) * dtau, so this is the total tau for all the slices remaining until we reach the final value of beta. (Each slice changes by dtau). Now, the first thing which occurred to me was that I had no real idea what the functions in the denominator looked like, so we can have a look at hyperbolic sine and hyperbolic tangent:

Hyperbolic sine seems to increase exponentially, and so flattens out everything the larger dtau and dtau_prime are, but hyperbolic tangent approaches an asymptote of 1, so as dtau becomes larger 1/tanh(dtau) will tend toward 1. dtau depends entirely on how large beta is, while dtau_prime decreases with each slice, so the formula has the effect of pulling us on a track toward xend, at least with small beta. When beta is large, Ups1 will tend toward 1 and Ups2 will tend toward 0, so we will make a random selection of x from the standard normal distribution, mean zero, standard deviation 1.

BETA1.0 With small beta, the course is determined. One starts at the initial position and ends at the final position with only small deviations, but with large beta things are different.

BETA20.0 Now things look different. The mean for selection is being calculated by Ups1/Ups2 with this generally tending to 0/0. The position is very uncertain and the initial and final positions have little effect of the position of the particle at any given slice.

Tuesday, June 24, 2014

Trying out some techniques from Coursera Machine Learning Class

I finished the Machine Learning course from Andrew Ng on Coursera recently. The last topic he covered in the course was on recommender systems. There was a technique he described that involved multiplying a matrix of user preferences by a matrix of movie scores to fill in data on movies which had not been rated, and to allow the system to make recommendations for movies the users had not seen. It immediately occured to me that this should be a general method one could use to fill in missing data in many sorts of situations. I decided that I would need to investigate the technique further after the class ended. I would try to find some relatively simple data file and see how one would go about applying the technique to it. During a lecture, professor Andrew Ng mentioned that the technique he was describing was known as Low Rank Matrix Factorization. I did a bit of web searching and found more information under "Low Rank Matrix Reconstruction."

When I had some free time, I went straight to the UC Irvine Machine Learning Lab website to see if I could find some relatively simple, real world data to try the technique on. I found a data file on cardiac arrhythmia which seemed about right for a test: not too big, a fair amount of missing data, mostly numeric, and possibly low rank. I know nothing about electro-cardiogram procedures, but thanks to Wikipedia, I discovered that they involved recording data through twelve leads. The data file itself has 279 attributes recorded for 452 individuals. In principle, all these 279 attributes should be created by some mixture of the twelve variables of the different leads, so this data should definitely be a low rank matrix, and possible to reconstruct using one method or another.

First I needed to do some data munging on the file. Beginning with code I found at:

http://www.togaware.com/datamining/survivor/Cardiac_Arrhythmia.html

I wrote some code in R to label all the variables, remove non-numeric columns, split the sample into training and test sets, and save the results into Rdata files.

Setup.R

First, I tried running svd() on the dataset with all NA's removed. This reduced the dataset to only 68 complete rows, but it seemed likely to give me some idea of how many features there would be in an optimal basis.

Eyeballing things, it seems as though 10 to 20 features would be a reasonable reduced basis, at least from the complete data examples. This is good, since it does not obviously conflict with the number of leads used in the apparatus, which is the reason to settle on this number of parameters. It seems that we should be able to apply the method of minimizing parameters to this data to recover the NA values.

So on to the code which should let us try this in R. First, a file of which specifies the cost and gradient functions for optim(), then a script to run them and see whether this approach recovers missing values which we have zeroed out ourselves for a test.

CollaborativeFiltering.R

Test.R

The functions in CollaborativeFiltering.R are a basic translation of what I wrote in the Coursera Machine Learning class from MATLAB, or octave, to R.

Regrettably, none of this worked quite as I had thought. First, I had a bit of a struggle with optim() just understanding how one should use it with two entire matrices as a parameter. I decided that I should just flatten everything to a vector, just as I had in octave. Then, much more confusing, I found that optim() was zeroing out the entire second matrix. Since the process requires that one multiply matrix X * Theta' to regenerate the data matrix, this would not work at all. I found that when I fed the data in without any sort of normalizing, (which was required at least by the course lectures), I got back two matrices, but when I touch the data in any way, (either by dividing by the max of each feature or subtracting the average), the second matrix came back filled with zeros. This was quite useless for a method that required multiplying one matrix by another.

Since I wasn't having success with R, I decided to return to the original octave code, and see whether the method worked on this data. Unfortunately, it did not work there any better. The code ran and produced a matrix of the proper dimensions of cardiac readings by patients, but the numbers were not even the same order of magnitude at many points in the regenerated matrix.

Some things which occur to me:
The movie score matrix was very simple. It was composed of integers between 1 and 5 with many missing data points. Next, the problem that was being solved wasn't actually restoring data, it was yet another classification problem. There are two classes which a movie can belong to in this problem, these being interesting, and uninteresting. We don't need to accurately construct an unknown score, we only need to sort this particular movie into one class or the other. I hadn't been aware that I was actually solving a classification problem while I was working through the code for class. In fact, classification problems seem to be the only sort of problems Machine Learning concerns itself with at least by the course material for this class. This again is something that I had not noticed until I started working on this data set. Possibly this method could be improved by adding another variable like the lambda to push down on the mean squared error section of the code, but it would probably be best to try another approach. I read of a method that involved reconstructing a matrix by minimizing the sum of the singular values returned by svd(). This seems more likely. I might try this, as well as trying a neural net and K-means on this data set if I have some more free time.

MathJax