A more perfect vacuum: September 2014

Sunday, September 28, 2014

Basic Tutorial for Puppet on KVM inside CentOS guest

I've been reading a bunch of posts lately on the twitterverse going on about how impossible it is to get any modern, sexy technology to run on RedHat or CentOS. Well, that probably isn't strictly true I thought. Since even Puppet was on some of these lists, and I had already been planning work on figuring out how Puppet worked, I thought I might have a go. Why not try installing it on a CentOS 6.5 virtual machine first, just to try it out and figure out how to do some basic operations?

Puppet is a tool that can be used to control the configurations of a number of machines from a central server. This is something that seems quite useful, but just reading odd posts online won't necessarily make things clearer. It seems to be some sort of complex thing that involves ruby doing something to do with computers. Meanwhile, the documentation on puppetlabs.com gives a clear description of how to install, but then wanders off into the details of config files that don't seem to apply to anything too comprehensible. The best documentation on puppetlabs for setting up a basic demonstration is outdated and no longer even runs on a current puppet installation without modification. Even after the simplest fix, it still spits out five or six deprecation warnings. So, let's try to fix some of this and get a basic system of two virtual machines running to try puppet out.

First, you will need to have a virtualization system installed. I'll be working
with libvirt and kvm installed on Debian 7. This is set up so that the guest machines are connected to each other and the outside world through a bridge interface running on the host system. This seems to be the best setup for a system that behaves like a real network, but the setup can be a bit confusing, perhaps I can manage a blog post about this as well at some point.

First we need to create a CentOS guest for the puppet master, that is, the machine that will be used to control the configurations of our other machines. Let's try it using the virt-install command. (You need to be root to use virt-install).

root@veryfine:~# virt-install --connect qemu:///system -n pfennig -r 1024 --vcpus=1 --accelerate \
--disk path=/var/lib/libvirtimages/pfennig.img,size=8 \
--network bridge=br0 --graphics vnc \
--cdrom=/var/lib/libvirt/iso/CentOS-6.5-x86_64-minimal.iso

This command has quite a bit going on so I will try to explain it a bit. We will be creating a machine named pfennig with 1024Mb of ram. It will use a bridge called br0 for its network interface. The console and graphics will be passed out over vnc. We will be installing from an iso image file of CentOS-6.5 and we will be installing to a disk image file with a size of 8GB. This should be run at the host machine where it will automatically pop up an install screen like:

It is perfectly possible to do this remotely, but might as well start with the easy way first. Then just work through the steps of the install with only one particular point to simplify things. Be sure to click on the button which allows you to configure the network.

Then set the virtual machine to a fixed ip address in a range that you have reserved from your router's dhcp. This will save you some steps learning or relearning the various files that RedHat uses to configure network interfaces.

We will need a machine for our puppet server to manage, so let's clone pfennig to speed things up and save another install:

# virt-clone --connect=qemu:///system --original=pfennig --name=farthing \
--file=/var/lib/libvirt/images/farthing.img

This should print out a progress bar and run for a while since it is copying several GB of disk image.

Next we might want to do some things to set up a low bandwidth method of managing the machine using the console command from inside the virsh command shell.

The virsh command is a utility which lets you manage virtual machines from the command line. It must be run as root, but it makes it possible to ssh in remotely and undertake almost all virtual machine management from your favorite coffee shop, even if your ISP isn't very good about giving you upstream bandwidth. In my case, so far I haven't gotten vnc to be useable for much of anything, but ssh and virsh work fine.

When you ssh into your main host, su root and type "virsh", you will see something like:

virsh#

This is the virsh command prompt. Here you can type commands like "start pfennig" for example to start the machine we created. You can also type "console pfennig", and if pfennig is running you will see a blank screen with a message across the top saying that the escape character is ^-]. No login prompt will appear. We need to fix this. This console which virsh is giving us is a serial console, so we will need to set the virtual machine up to start agetty on ttyS0.

Start the machine from virsh and then get the console which it has exported over vnc by typing

$ vncviewer localhost

In another terminal. This should bring up a vnc window of the virtual machine console. Log in with the password you gave for root during the install. Install setserial to check what serial port has been initialized by the virtual machine, (almost certainly ttyS0).



# yum install setserial

Then to check:

# setserial -g /dev/ttyS[0123]
/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
/dev/ttyS1, UART: unknown, Port: 0x02f8, IRQ: 3
/dev/ttyS2, UART: unknown, Port: 0x03e8, IRQ: 4
/dev/ttyS3, UART: unknown, Port: 0x02e8, IRQ: 3

This output shows that ttyS0 is the only working port. Now we need to edit /etc/securetty and create /etc/init/ttyS0.conf. Append ttyS0 to the end of /etc/securetty and create /etc/init/ttyS0.conf so that it looks like:

# /etc/init/ttyS0.conf
start on runlevel [345]
stop on runlevel [S016]

respawn
instance /dev/ttyS0
exec /sbin/agetty ttyS0 19200 vt100-nav

Next, we need to edit one more file. Though we told the install to create a network connection, and gave it an address, it evidently does not believe it should start the thing on boot. Probably I missed a check box. So now lets edit /etc/sysconfig/network-scripts/ifcfg-eth0. We need to change the line:

ONBOOT=no

To:

ONBOOT=yes

Restart the virtual machine by running /sbin/shutdown -r now from the console you have open and when it restarts try "console pfennig" from the virsh shell. Once again you will see a blank screen, but if you hit return you should get a login.

virsh # console pfennig
Connected to domain pfennig
Escape character is ^]

CentOS release 6.5 (Final)
Kernel 2.6.32-431.29.2.el6.x86_64 on an x86_64

pfennig login:

Now, we will be working from the installation guide for puppet on PuppetLabs, but skipping a few steps that aren't really need to just test things.

First, we will need to punch port 8140 through the firewall the CentOS runs by default for puppet to work.

# iptables -n -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED 

ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

# iptables -I INPUT 2 -p tcp --dport 8140 -m state --state NEW -j ACCEPT

# /sbin/service iptables save

First we list all the firewall rules with iptables -n -L, then we insert a rule to allow new connections on port 8140 as the second rule, and then we save the rules so they will come back after a reboot. We need to do this on every host we want to control using puppet, so we will need to open port 8140 on farthing as well when we get that far.

We will need to set up /etc/hosts on pfennig and on every other host that will be using puppet. The right way to do this would be to set up a DNS server, or use dnsmasq, but we are just trying to get a test system up and running and see a bit about how it works. To make this happen we need an alternate name of puppet for our virtual machine on every host that will use pfennig for a puppet master server.

# /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.161   pfennig.homenet puppet
192.168.0.162   farthing.homenet farthing

After we have completed these preliminaries, we need to install a repo for puppetlabs, and then install puppet on pfennig. We need to install an rpm file to enable the puppetlabs repo. Since we are installing on CentOS 6 the command is:

# rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm

Then we need to install the puppet master server

# yum install puppet-server

Don't start anything yet, we need to do a few things. First, we need to edit /etc/puppet/puppet.conf. Look for the [main] section and set the variable dns_alt_names so it will take you to your virtual guest machine, so something like this but corrected for whatever names you have chosen:

dns_alt_names = puppet,pfennig,pfennig.example.net

Then we need to create a certificate on the puppet master server.

# puppet master --verbose --no-daemonize

This will print out a message Notice: Starting Puppet master version , at which point one can hit ctrl-C to kill the server. This is now a working test puppet master server. The guide on puppetlabs will insist that you need to install a production web server. Don't do this for a test system like this. The "Puppet with Passenger" install died in the middle of the install at least on Debian 7 for me, and I had to go through a painful process to remove the apache version that had been installed. We don't need any of this to test puppet on a couple of virtual machines, since puppet includes its own built in web server that will do fine to check things out.

Now, we need to start up farthing and see if we can get it running as a puppet agent node. Hit ctrl-] and you will be back at the virsh # prompt. From here simply type "start farthing" to get the guest we cloned earlier running, then wait for 10 or 15 seconds to be sure it has booted.

Since we haven't got our serial terminal set up, we will need to use virt-viewer or vnc to continue with our setup on farthing. Let's try accessing it remotely with ssh just to see how this works. (You will of course need to forward port 22 through your router first. Using ssh with secure key login is also a very good idea). First we need to truck off to the local coffee shop with our laptop then we'll type:

ssh youruser@your.external.host.or.ip.com -L 5901:127.0.0.1:5901

And then in a different terminal of your laptop:

vncviewer localhost:5901

The reason for using port 5901 is that farthing is the second machine we started and so should use the second vnc port. If everything goes as planned, you should see a vnc terminal with a login for farthing. Now, we need to repeat some of the steps we took to install puppet on the puppet master server, and then start things up and get pfennig to sign a certificate from farthing. Login to the vnc window for farthing as root, and install the repo for puppetlabs:

# rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-6.noarch.rpm

# yum install puppet

Now you need to get the console for pfennig as well as farthing so that you can start the puppet master server if it is not already running and then send a certificate request from farthing. A simple secure shell into your host will let you run virsh, or you can forward port 5900 using the command above to start a second vnc tunnel. I kind of like using the console from virsh, since the overhead seems lower.

Start puppet on pfennig with the command: puppet master as root. Then move to the terminal for farthing and type: puppet agent --test This will send a certificate request to the puppet master server on pfennig. Now, move back to the terminal of pfennig and type: puppet cert list. You should see the request from farthing for a certificate. Now sign the certificate request from farthing with puppet cert sign farthing. Now, everything should be ready to go, we can start writing configs on pfennig and seeing if they get pushed out to farthing.

Now, we will build a directory tree on pfennig and create several files that should install setserial and the two config files we changed to get a serial terminal working. We will end up with a directory tree that looks like this:

Tree of /etc/puppet Files
 |-manifests-|-site.pp
 |
 |-modules---|-setserial-----------------|-files-----|-securetty
                                    |                |-ttyS0.conf
                                    |
                                    |-manifests-|-init.pp

This is a modified version of instructions I found on the puppetlabs website: http://projects.puppetlabs.com/projects/puppet/wiki/Advanced_Puppet_Pattern. Unfortunately, the site.pp file they give there does not run, and even after you fix the main error, it produces a half page of deprecation warnings. So I tried to fix things here as far a I could. First, we need to build the directory tree. The modules and manifests directories should exist from the install, but there will be nothing under them, so we need to create a directory called setserial under the modules directory to hold our files, that is, the securetty file and the ttyS0.conf file, and the init.pp file that will control what is installed.

# mkdir /etc/puppet/modules/setserial
# mkdir /etc/puppet/modules/setserial/files
# mkdir /etc/puppet/modules/setserial/manifests

# cp /etc/securetty /etc/puppet/modules/setserial/files
# cp /etc/init/ttyS0.conf /etc/puppet/modules/setserial/files

# vi /etc/puppet/modules/setserial/manifests/init.pp

We need to create init.pp so that it looks like this:

#/etc/puppet/modules/setserial/manifests/init.pp
class setserial {

  package { setserial: 
              ensure => latest,
              allow_virtual => false,   
  }

  file { "/etc/securetty":
    owner => 'root',
    group => 'root',
    mode => '0400',
    source => "puppet:///modules/setserial/securetty",
    require => Package["setserial"],
  }
  file { "/etc/init/ttyS0.conf":
    owner => 'root',
    group => 'root',
    mode => '0444',
    source => "puppet:///modules/setserial/securetty",
    require => Package["setserial"],
  }
}

Next, we need to create the file site.pp. They say that this does not need to be a single file, but can be a directory instead. No sort of examples of this seem to be provided, only the somewhat opaque man page sort of documentation, so I won't begin to try showing how to do this.

# vi /etc/puppet/manifests/site.pp


# File - /etc/puppet/manifests/site.pp

# import node defs so they don't all have to be in site.pp
# Unfortunately, importing nodes is deprecated
#
#import "nodes"
#
# All nodes must be defined here if one defines any nodes in this file

# This pulls a module in and makes it default for all nodes.
# It may be required.
node default {
  include setserial
}
#
# This creates a node and installs some custom utils all from this file.
# This works if you wish to manage everything from this one file.
node 'farthing' {
  include custom_utils
  # This should be handled by the default node
  #include setserial
}

class custom_utils {
  package { ["nmap", "traceroute", "vim-enhanced"]:
    ensure => latest,
    allow_virtual => false,
  }
}
# Try to manage pfennig which is puppet master server
# This also works.
node 'pfennig' {
  include custom_utils
}

Next, we need to test our setup out. On pfennig run:

#  puppet apply --noop /etc/puppet/manifests/site.pp

This will compile our manifests, but not actually run or install anything. If everything works you can type:

# puppet apply /etc/puppet/manifests/site.pp

In the present case, this should install traceroute, nmap and vim on pfennig. Next, we want to see if we can get setserial and our config files installed on farthing and get a serial console working, so we go to our vnc console for farthing and type:

puppet agent -t

The puppet agent node should request updates every 30 minutes by default, but setting all this up would next stage if you were really going manage a cluster of machines. We just want the thing to show us what it can do at the moment. With some luck farthing will now have setserial and the required config files installed. We will need to run: CHECK THIS

# service ttyS0 start

or reboot farthing to actually use the serial terminal. Now, something that I discovered that shows a bit more of the slickness of puppet. When I first starting working on this post, I was thinking mostly of using puppet to installing elasticsearch. To install elasticsearch one must install java first. Puppet has a module to install java which can be installed quite simply:

# puppet module install puppetlabs-java

This installs a module from puppetlabs without any further work on your part, then to use it we need to update /etc/puppet/manifests/site.pp

# Add this class to site.pp since elasticsearch requires jdk and the module installs jre by default.
class { 'java':
        distribution => 'jdk',
}

# Alter the node definition of the host you wish to install java on so that it looks like this
node 'searchhost' {
  #include other_modules
  include java
}

I hope this should be some use to people just trying out puppet, and give some idea of how it could be used.

Friday, September 5, 2014

Implementing the Bayes Spam Filter from the book "Doing Data Science"

I decided to have a try at implementing the spam filter described in "Doing Data Science" by Cathy O'Neil and Rachel Shutt. The book gives a fairly clear description of Bayes Theorem, and a bash script which can calculate the probability of individual words being spam, but does not give any real direction for actually coding a multi-word Bayes Filter. They do give the formula that such a filter will be based on, but don't really give any clues on how one would code it, apart from solving it out by taking the log of both sides and then proving that Laplace smoothing is valid. This is not quite as user friendly as one might wish, but it does leave plenty to try out and learn.

First, we will need some data so we might as well fetch the Enron emails that they are using in the book for the bash script.

wget 'http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz'

I am assuming that everyone is running linux here and has the wget command installed. One could do this just as well from inside R. Next we will need to unpack the data.

tar xzvf enron1.tar.gz

This will unpack everything into a directory called enron1, with a directory called spam and another called ham underneath it. All the email messages are stored as text files within each directory, one per message. Now, the formula they give for a multi-message spam filter looks like this:

$$p(x|c)=\prod_j\theta_{jc}^{x_j}(1-\theta_{jc})^{(1-x_j)}$$

You can code the formula directly in R with something like this:

pxGivenC <- function(theta, x) {
  not_x <- 1 - x
  not_theta <- 1 - theta
  pxc <- 1
  for(i in length(x)) {
    pxc <- theta[i]^x[i] * not_theta[i]^not_x[i]
  }
  return(pxc)
}

But this will not work as well as one might hope because the differences in probabilities become so tiny that they fall off the end of the floating point implementation. If one tries this approach on the Enron dataset, the model guesses that everything is is ham since the ratio of ham to spam is about 0.71 to 0.29. The book recommends taking the log of both sides and then solves out sections of the resulting formula which do not change for individual email messages, but it doesn't really explain where the formula came from. So let me try to explain what I have managed to figure out so far. You might want to download a reference given in the book, Spam Filtering with Naive Bayes - Which Naive Bayes? This was a math heavy explanation of the various different ways one can implement Bayes' Theorem, but it explained where the equation above came from, which "Doing Data Science" did not.

In the book they take the log of both sides of the formula, which is apparently necessary to have any useful output, and get:

$$\log(p(x|c))=\sum_jx_j\log(\theta_j/(1-\theta_j))+\sum_j(\log(1-\theta_j))$$

Most of this formula doesn't vary with the individual messages, so it can be extracted and solved once ahead of time:

$$w_j=\log(\theta_j/(1-\theta_j))$$

$$w_0=\sum_j(\log(1-\theta_j))$$

So substituting we get:

$$\log(p(x|c))=\sum_jx_jw_j+w_0$$

Now, what does all this mean and what have we calculated in relation to Bayes' theorem? First, we haven't calculated what we were actually looking for which would be p(spam|word) for all the words in an email message taken together, instead we have calculated p(word|spam) with "word" meaning the probability of finding this collection of words at least once in an email message we know to be spam. This is evidently the hard part according to the book, but in my case at least, I found a few other confusing things.

First, let's go over all the steps we need to take before we can even apply this formula.

We need to count all the words that occur in spam and ham email and turn this into a dictionary with the counts of each word. But, this is not a count of how many times we found a word in a spam message and how many times we found a word in a ham message. Instead, we are counting the number of spam messages we found this word at least once in, and then counting the number of ham messages we found the same word at list once in and saving these counts. This information is used to construct theta in the formula above. Theta is a long vector of probabilities of finding each and every word in the dictionary in a spam email. According to the book, it is constructed by this formula: (count of spam emails containing word)/(count of all emails containing word). On the other hand, the reference the I copied above gives the formula as (count of spam emails containing word)/(count of spam emails). The second form would seem to be the actual probability of p(word|spam), while I haven't quite figured out exactly what the first form is. Maybe it allows one to calculate the denominator of Bayes' Theorem more easily, but the authors do not explain this in the book.
After we have built a great deal of infrastructure to let us read in all the email messages of both classes and count how many messages of both classes contain each word, we need to calculate theta by either one or the other formula, and then calculate w and w0 for our training set. All these counts and variables are determined by our training data and have no dependence on any testing email we are classifying. As near as I can see, the process of counting the number of emails of our training set in which we find each word is the entire training process.
Next, we need to read in a testing email, parse it into individual words, and construct an x vector of zeros which is as long our dictionary. We will flip the x vector to a one at every index of a word in our dictionary which we find in the email message we are testing.
Now we can finally load the values into the formula above and derive the equivalent of log(p(word|spam)) for all the words in our test email message. This produced outputs like -1084.43 for me, as well as many cases of -Inf and NaN. In order to avoid the values of NaN and -Inf, one must ensure that none of the words have a probability of zero with regard to either ham or spam messages. To do this, one must alter the formula used to calculate theta by adding a small alpha and beta to the numerator and denominator of the ratio of counts. According to the book, alpha and beta should be some value < 0.2, while according to the online source alpha = 1 and beta = 2 is the correct choice. The first problem of the relatively large negative number is more perplexing. This is the log of a probability, so to calculate the probability of finding our email message x given that we knew it was spam we will have to take the exponential of -1084.43, and this will be zero according to R. Assuming we have set up everything right on the way to here, we will need to find some way of turning this into p(spam|x) that doesn't just involve running it through exp().
Since we can't just take the exponential we will need to keep everything as a log until we solved the denominator of Bayes' Theorem, and then hopefully we will have something that will be a more likely probability. Looking online I find a log identity that seems useful: log(a + b) = log(a) + log(1 + exp(log(b) - log(a))). This can be used to construct the denominator of Bayes' Theorem without taking the anti-log, and so after a few algebra incidents we can see things which look like valid probabilities come out of our little dummy Bayes' Classifier.

This is the code for a little dummy program which implements everything appart from the parsing. The code for the entire project is on github at https://github.com/joelkr/BayesFilter.git. The code is in R and it works, but the section which parses and builds the dictionary is seriously slow. So here is something I wrote to try to figure out why I wasn't getting valid probabilities out of my huge Rube Goldberg parsing contraption. It creates a dummy dictionary out of two letter tokens and then constructs some email vectors with spam emails mostly having odd tokens and ham messages mostly having even tokens:

MathJax

MathJax

Sunday, September 28, 2014

Basic Tutorial for Puppet on KVM inside CentOS guest

Friday, September 5, 2014

Implementing the Bayes Spam Filter from the book "Doing Data Science"