How to get started with data science in containers

来源:互联网 发布:虚拟机无法桥接网络 编辑:程序博客网 时间:2024/05/16 17:44

http://blog.kaggle.com/2016/02/05/how-to-get-started-with-data-science-in-containers/



he biggest impact on data science right now is not coming from a new algorithm or statistical method. It’s coming from Docker containers. Containers solve a bunch of tough problems simultaneously: they make it easy to use libraries with complicated setups; they make your output reproducible; they make it easier to share your work; and they can take the pain out of the Python data science stack.Bell_jar_apparatus

We use Docker containers at the heart of Kaggle Scripts. Playing around with Scripts can give you a sense of what you can do with data science containers. But you can also put them to work on your own computer, and in this post I’ll explain how.

Why use containers?

Containers are like ultralight virtual machines. When you restore a normal VM from a snapshot it can take a minute or so to get going, but Docker containers start up in roughly a millisecond. So you can run something inside a container just like you’d run a native binary. Every time you restart the container, its execution environment is identical, which gives you reproducibility. And containers run identically on OS X, Windows and Linux, so collaborating and sharing becomes much easier than before.

Personally, I think the best thing about containers is that they eliminate the pain of using Python for data science. R and Python are both great for statistics, each with its own strengths and weaknesses, but one striking difference between them is in how they handle libraries and packages. R’s install.packages() mechanism works very smoothly, and conflicts between packages are rare. If you come across a new piece of work that uses a library you don’t have on your system, you can install it from CRAN and be underway in a few moments.

What a contrast with Python. In the Python world, a typical workflow would be something like this: notice that you need libraryX, so call pip install X, which also installs dependencies AB and C. But B already exists on your system via easy_install, so pip cancels itself but only partially removes the new stuff, then import B refuses to work ever again. Or you discover that Crelies on a later build of numpy, which you install, only to discover that libraries Y and Z are linked to an older numpy library that just got stomped on. And so on, and so on.

Python installations gradually accrete problems like this, with conflicts building up between libraries, and further conflicts between separate Python setups on the same system. The virtualenv system helps a little, but in my experience it just delays the crash. Eventually you reach a point where you have to completely reinstall Python from scratch. And that’s not to mention the hours you can spend getting a new library to work.

If you use Python in a container instead, all those problems vanish. You only have to invest time once in setting up the container: once the build is complete, you’re all set. In fact, if you use one of Kaggle’s containers, you don’t need to worry about building anything at all. And you can try out new packages without any hassles, because as soon as you exit a container session, it resets itself to a pristine state.

What’s in them exactly?

To run Kaggle Scripts, we put together three Docker containers: kaggle/rstats has an R installation with all of CRAN and a dozen extra packages, kaggle/julia has a recent build of Julia 0.5 with a set of data science libraries installed, andkaggle/python is an Anaconda Python setup with a large set of libraries. To see the details of what’s inside, you can browse the Dockerfiles that are used to build them, which are all open source. We had to split them up into several parts so we could auto-build them on Docker Hub: here are links to Python part 1, part 2, part 3; rcran0 to 22, and rstats; and Julia part 1, part 2.

One side note: we only support Python 3. I mean come on, it’s 2016.

How to get started

Here’s a recipe for setting up the Python container locally. These exact steps are for OS X, but the Windows or Linux equivalents are easy to figure out if you rtfm.

Step one is to head over to the Docker site and install Docker on your system. They’ve made the install process very easy, so that shouldn’t take more than the twinkling of an eye.

Step two: the default install creates a Linux VM to run your containers, but it’s quite small and struggles to handle a typical data science stack. So make a new one, which in this example I’ll call docker2.

1
$ docker-machine create -d virtualbox --virtualbox-disk-size "50000"--virtualbox-cpu-count"4"--virtualbox-memory"8092"docker2

Obviously, you can tailor the disk-sizecpu-count and memory numbers for your system. Step three: start it up.

1
2
$ docker-machine start docker2
$eval$(docker-machineenvdocker2)

Later, if you open a new terminal window and Docker complains about Cannot connect to the Docker daemon. Is the docker daemon running on this host? then rerunning those two lines should sort it out.

Step four: pull the image you want to use.

1
$ docker pull kaggle/python

You’re now at a point where you can run stuff in the container. Here’s an extra step that will make it super easy: put these lines in your .bashrc file (or the Windows equivalent)

1
2
3
4
5
6
7
8
9
10
kpython(){
  docker run -v$PWD:/tmp/working-w=/tmp/working--rm-it kaggle/pythonpython"$@"
}
ikpython() {
  docker run -v$PWD:/tmp/working-w=/tmp/working--rm-it kaggle/pythonipython
}
kjupyter() {
  (sleep3 && open"http://$(docker-machine ip docker2):8888")&
  docker run -v$PWD:/tmp/working-w=/tmp/working-p 8888:8888 --rm-it kaggle/pythonjupyter notebook --no-browser --ip="\*"--notebook-dir=/tmp/working
}

Now you can use kpython as a replacement for calling pythonikpython instead of ipython, and run kjupyter to start a Jupyter notebook session. All of them will have immediate access to the complete data science stack that Kaggle assembled.

I hope you enjoy using these containers as much as I have. And let me just add one more plug for Kaggle Scripts—it’s a great way to share ideas and show off what you’ve made.

P.S. Here’s some more detail on how the .bashrc entries work. The three commands are Bash functions. The syntax docker run ... kaggle/python X will execute command X inside the Kaggle Python container. You give the container session access to the directory that you’re currently in by adding -v $PWD:/tmp/working, and for convenience -w=/tmp/working makes the session start in that working directory. The --rm switch tidies up the container session after you exit. By default, Docker sessions hang around in case you want to do a post-mortem on them. Finally, the -it means that the container’s stdin, stdout and stderr will be attached to your terminal. There are many other options that you can use, but I’ve found those to be the most useful.

Jamie Hall is a data scientist and engineer at Kaggle. This article is cross-posted from his personal blog.

DOCKERPRODUCTPYTHON
  • Pierre-Alain

    Excellent post ! Thanks a lot.
    It really *was* a pain to install a python data science stack.

    Note : I had to change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)

    • Badrul Alom

      I did this, but now I'm getting an error when I try to launch kjupyter: Couldn't get a file descriptor referring to the console
      and going to http://0.0.0.0:8888/ just tells Firefox couldn't connect

    • Fei Zhan

      Awesome. Solved my problem as well.

    • Johnny Chan

      In addition to the ip change (great thanks for this!), make sure `/tmp/working` exists. If not, create it with `mkdir /tmp/working`. Now when you run `kjupyter` you may copy and paste the url from console to a browser: `The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=xxxxxxxxxxxxxxx`. (I notice that the auto brower pop up does not include the token bit. You need to physically copy and paste the entire URL string with the token part to the browser).

  • Michał Wajszczuk

    Thanks for insights about Docker!

    I have a question what is the size of kaggle/python image? Because my SDD have some space limiation.

    • T. Morgan

      For me, the image has proven to be about 15GB. It's huge.

  • Diego Menin

    Hi, I'm confused about the "$PWD:/tmp/working -w=/tmp/working"; Where is that tmp/working folder supposed to be?, I couldn't find it anywhere. I imagine that's where the object on the starting page should live, right?

    • Gabi Huiber

      It seems to me that this is your present working directory in the Docker virtual environment. If this recipe worked for you, when you do 'pwd' you will still see your current pwd path on the host, and no /tmp/working anywhere. But when you go to the kpython prompt, os.getcwd() will return /tmp/working.

  • Alex Telfar

    Hmm. Dont' know what I have done wrong, but I can't seem to get the jupyter notebooks working in the docker container. When I run your command (kjupyter), I get

    socket.gaierror: [Errno -2] Name or service not known

    and it tries to take me to some random IP which fails.

    I also tried launching it from within kaggle/python environment and i get

    No web browser found: could not locate runnable browser.

    Any pointers? (using mac and the other commands work fine...)

    • Dario Lopez Padial

      I resolved it in kjupyter with --ip="0.0.0.0"

  • Samir

    Do as per Pierre-Alain suggested for MAC user:

    change --ip="*" by --ip="0.0.0.0" in .bash_profile to make the kjupyter command work (Mac)

  • Jenny Yu

    Hi, I downloaded Docker Toolbox (my PC is Windows 7), and followed your example to pull the kaggle/python. I've tried multiple times, but it always freezes (see picture attached). Is there a way around this problem? Thanks.

    • César Palma Morante

      Did you solve this?

      • Jenny Yu

        No i didn't solve it. Still a problem .

    • Sergio Casca

      It froze me once because the partition where I was storing the docker images ran out of free space. Hope it's the same simple case.

  • Adam Levin

    Warning, if you have less than 8GB of ram on the machine you try to install this on, you are in for a wild ride.

  • D8amonk

    Any windows users looking to add those commands, remember you've got to vim a .bashrc file with the above (last) snippet pasted in, and then also vim a .bash_profile containing the single line `. .bashrc` so it gets run every time you open the docker quickstarter.

  • Andrey Akhmetov

    Hey Guys! Was anybody able to run notebooks on Ubuntu/other linux?

    • Daniele

      It's working for me changing --ip parameter:

      docker run -v `pwd`:/tmp/working -w=/tmp/working -p 8888:8888 --name kaggle --rm -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/working

  • M. K.

    Hi, Anyone knows how to access jupyter notebook once the connexion is launched? Since bashrc include --no-browser, I appreciate we need to launch the dashboard manually, but how exactly?
    My prompt windows says 'The Jupyter Notebook is running at: http://0.0.0.0:888/'. But when I type this into my browser (Chrome), it tells me it's not accessible. Any help would be greatly appreciated.
    Please note:
    - kpython and ikpython work fine
    - I have Windows
    - I have changed ip="*" by --ip="0.0.00" as suggested. Tried 127.0.0.0 as I thought 0.0.0.0 is a Mac-only address, but same issue
    - prompt window message ends with "~/.bashrc: line 8: open: command not found" not sure if it's related to the --no-browser thing but thought it could help diagnostic what's wrong

  • John Zhu

    This worked for me on Mac

  • John Zhu

    FOR MACS:

    from:
    --ip="*"

    to:
    --ip="0.0.0.0" in .bash_profile

  • Shan Lin

    The image that get pulled locally doesn't contain any dataset. How do I retrieve a dataset from Kaggle?

  • Amit

    Is there any instruction for setup in docker for mac?

  • Anneloes Louwe

    Nice post! One question: I have TensorFlow installed and working on my (host) computer. However, when I run TensorFlow inside the kaggle container, it uses only CPU. Does anyone know how to fix this?

  • Vincent

    I got the error "docker: Error response from daemon: invalid bind mount spec ..." on my Windows 10. Anyone knows how to solve the problem?

  • tanventure

    Thanks for your notes, very interesting. Just want to let you know the links above: links to Python part 1, part 2, part 3; rcran0 to 22, and rstats; and Julia part 1, part 2, are all broken. Please take a look and I am keen to read them.

    Tanventure

  • Johnny Chan

    Does it mean we need to store all notebooks and kaggle datasets under `/tmp/working`? (and what if the mac gets rebooted and `/tmp` gets flushed away? I'm keen to store both notebooks and datasets somewhere under my local `$HOME` directory. The problem I'm facing is that within the kjupyter notebook environment I'm only allowed to "see" `/tmp/working` (i.e. can't get to my `$HOME` on the mac). Any top tip I would be very grateful!

    • Johnny Chan

      Ahhh... I have just solved the problem! The key is the current directory where you invoke the `kjupyter` command. i.e. e.g. if I invokve `kjupyter` at `/Users/johnny/kaggle`, then all subdirectories would be "mapped" to `/tmp/working/` on the docker machine.

  • Andrew Nyago

    I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is left.

    can someone please inform me on the maximum size of that file please

  • Andrew Nyago

    I've been running docker run --rm -it kaggle/rstats for a two days now (internet is slightly slow) but i'e got all the parts bt there's a file f0b24ff7f2aa that is currently at 6GB and doesnt show how much is left.

    can someone please inform me on the maximum size of that file please