MathJax

MathJax

Friday, March 3, 2017

Using Docker to Create a Consistent Development Environment for Kaggle

I have been struggling with learning to do effective submissions to kaggle. One of the things that was holding me back was that I found I had to completely rebuild my python environment between my laptop and my desktop at home. Another problem involved RAM and general low power considerations. I often thought that I might try loading the kaggle data into a database, and would struggle with installing Postgres or Mysql, and then discover that SQL had nothing in common between the two. I was trying to find something I could try to learn more about Docker, why not try Docker to set up a consistent environment? After many trials and missteps, and much searching, I found an Docker image that seemed a good starting point here: wiseio data science docker. After quite a number of rebuilds and tests, I came up with this Dockerfile:

FROM ubuntu:16.04
#FROM ubuntu:14.04
# Extended and moved to python 3 based on image:
#MAINTAINER Wise.io, Inc. <help@wise.io>
ENV DEBIAN_FRONTEND noninteractive
ENV PATH /anaconda/bin:$PATH
# For image inheritance.
ONBUILD ENV PATH /anaconda/bin:$PATH
# Install packages ... change the timezone line if you're not in Pacific time
RUN apt-get -y update && apt-get install -y wget nano locales curl unzip wget openssl libhdf5-dev libpq-dev \
python3-pip python3-dev python3-numpy python3-scipy python3-matplotlib \
ipython3 ipython3-notebook python3-pandas python3-nose \
python3-dateutil \
&& apt-get clean && dpkg-reconfigure locales && locale-gen en_US.UTF-8 \
&& echo "America/Los_Angeles" > /etc/timezone && dpkg-reconfigure --frontend noninteractive tzdata \
&& apt-get autoremove \
&& pip3 install scikit-learn==0.18.1 nltk==3.2.1 pytest==3.0.5 \
&& python3 -m nltk.downloader -d /usr/local/share/nltk_data punkt book \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Get the pip packages and clean up
ADD requirements.txt /
RUN pip3 install -r /requirements.txt && rm -rf /root/.cache/pip/*
ENV LANGUAGE en_US.UTF-8
ENV LANG en_US.UTF-8
# $PASSWORD will get `unset` within notebook.sh, turned into an IPython style hash
ENV PASSWORD Dont make this your default
ENV USE_HTTP 1
ADD notebook.sh /notebook.sh
RUN chmod a+x /notebook.sh
EXPOSE 8888
RUN useradd --user-group --create-home --shell /bin/false workspace
ENV HOME=/home/workspace
ADD . $HOME
WORKDIR $HOME
ENV PEM_FILE $HOME/key.pem
COPY ./.keras $HOME/.keras
RUN chown -R workspace:workspace $HOME/*
USER workspace
CMD ["/notebook.sh"]
view raw Dockerfile hosted with ❤ by GitHub

Next, I need to create a consistent environment, and choose a container for my database experiments. I settled on the alpine version of the standard Postgres image, for the excellent opensource rep of Postgres, and its unparalleled ability to cause me frustration when I attempt to write SQL for it. To create this I wrote a docker-compose.yml:

notebook:
image: my-data-science-docker:0.1
ports:
- "80:8888"
environment:
- WISEDS_CODE_DIR=${PWD}
- WISEDS_DATA_DIR=${PWD}/data
# - IPYTHON_PASSWORD=WhoMe321
- PASSWORD=WhoMe321
- KERAS_BACKEND=theano
volumes:
- .:/home/workspace/
# - ./data:/home/workspace/data
links:
- db
db:
image: postgres:9.5-alpine
environment:
- POSTGRES_PASSWORD=WhoMe321
volumes:
- ./pgdata:/var/lib/postgresql/data

Park the docker-compose.yml file in a directory with kaggle sort of disk space available, create a directory ./data and ./pgdata and you should be good to go.

A handy bit to know for creating databases in the Postgres image, and checking that data has actually been saved:

$ docker exec -it yourdir_db_1 sh

/ # psql -U postgres

postgres=# CREATE DATABASE demo;

.... and so on for all your favorite SQL