I have been struggling with learning to do effective submissions to kaggle. One of the things that was holding me back was that I found I had to completely rebuild my python environment between my laptop and my desktop at home. Another problem involved RAM and general low power considerations. I often thought that I might try loading the kaggle data into a database, and would struggle with installing Postgres or Mysql, and then discover that SQL had nothing in common between the two. I was trying to find something I could try to learn more about Docker, why not try Docker to set up a consistent environment? After many trials and missteps, and much searching, I found an Docker image that seemed a good starting point here: wiseio data science docker. After quite a number of rebuilds and tests, I came up with this Dockerfile:
FROM ubuntu:16.04 | |
#FROM ubuntu:14.04 | |
# Extended and moved to python 3 based on image: | |
#MAINTAINER Wise.io, Inc. <help@wise.io> | |
ENV DEBIAN_FRONTEND noninteractive | |
ENV PATH /anaconda/bin:$PATH | |
# For image inheritance. | |
ONBUILD ENV PATH /anaconda/bin:$PATH | |
# Install packages ... change the timezone line if you're not in Pacific time | |
RUN apt-get -y update && apt-get install -y wget nano locales curl unzip wget openssl libhdf5-dev libpq-dev \ | |
python3-pip python3-dev python3-numpy python3-scipy python3-matplotlib \ | |
ipython3 ipython3-notebook python3-pandas python3-nose \ | |
python3-dateutil \ | |
&& apt-get clean && dpkg-reconfigure locales && locale-gen en_US.UTF-8 \ | |
&& echo "America/Los_Angeles" > /etc/timezone && dpkg-reconfigure --frontend noninteractive tzdata \ | |
&& apt-get autoremove \ | |
&& pip3 install scikit-learn==0.18.1 nltk==3.2.1 pytest==3.0.5 \ | |
&& python3 -m nltk.downloader -d /usr/local/share/nltk_data punkt book \ | |
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* | |
# Get the pip packages and clean up | |
ADD requirements.txt / | |
RUN pip3 install -r /requirements.txt && rm -rf /root/.cache/pip/* | |
ENV LANGUAGE en_US.UTF-8 | |
ENV LANG en_US.UTF-8 | |
# $PASSWORD will get `unset` within notebook.sh, turned into an IPython style hash | |
ENV PASSWORD Dont make this your default | |
ENV USE_HTTP 1 | |
ADD notebook.sh /notebook.sh | |
RUN chmod a+x /notebook.sh | |
EXPOSE 8888 | |
RUN useradd --user-group --create-home --shell /bin/false workspace | |
ENV HOME=/home/workspace | |
ADD . $HOME | |
WORKDIR $HOME | |
ENV PEM_FILE $HOME/key.pem | |
COPY ./.keras $HOME/.keras | |
RUN chown -R workspace:workspace $HOME/* | |
USER workspace | |
CMD ["/notebook.sh"] |
Next, I need to create a consistent environment, and choose a container for my database experiments. I settled on the alpine version of the standard Postgres image, for the excellent opensource rep of Postgres, and its unparalleled ability to cause me frustration when I attempt to write SQL for it. To create this I wrote a docker-compose.yml:
notebook: | |
image: my-data-science-docker:0.1 | |
ports: | |
- "80:8888" | |
environment: | |
- WISEDS_CODE_DIR=${PWD} | |
- WISEDS_DATA_DIR=${PWD}/data | |
# - IPYTHON_PASSWORD=WhoMe321 | |
- PASSWORD=WhoMe321 | |
- KERAS_BACKEND=theano | |
volumes: | |
- .:/home/workspace/ | |
# - ./data:/home/workspace/data | |
links: | |
- db | |
db: | |
image: postgres:9.5-alpine | |
environment: | |
- POSTGRES_PASSWORD=WhoMe321 | |
volumes: | |
- ./pgdata:/var/lib/postgresql/data |
Park the docker-compose.yml file in a directory with kaggle sort of disk space available, create a directory ./data and ./pgdata and you should be good to go.
A handy bit to know for creating databases in the Postgres image, and checking that data has actually been saved:
$ docker exec -it yourdir_db_1 sh
/ # psql -U postgres
postgres=# CREATE DATABASE demo;
.... and so on for all your favorite SQL