Using Python Subprocess To Drive Machine Learning Packages
JeanFrancoisPuget 2700028FGP Comments (3) Visits (14559)
A lot of state of the art machine learning algorithms are available via open source software. Many open source software are designed to be used via a command line interface. I much prefer to use Python as I can mix many packages together, and I can use a combination of Numpy, Pandas, and Scikit-Learn to orchestrate my machine learning pipelines. I am not alone, and as a result, many open source machine learning software provide a Python api.
I'd like to be able to use these packages and other command line packages from within my favorite Python environment. What can I do?
The answer is to use a very powerful Python package, namely subprocess. Note that I am using Python 3.5 with Anaconda on a MacBook Pro. What follows runs as well on Windows 7 if you use commands available in a Windows terminal, for instance using dir instead of ls. Irv Lustig has checked that the same approach runs fine on Windows 10, see his comment at the end of the blog.
First thing to do is to import the package:
We can then try it, for instance by listing all the meta information we have on a given data file named Data/week_3_4.vw:
Let's analyze a bit the code we executed. subprocess.run runs a command in a sub process, as its name suggests. The command is passed as the first argument, here a list of strings. I could have passed a unique string such as "ls -l Data/week_3_4.vw" . Python documentation says it is preferable to break the command into as many substrings as possible.
The subprocess.run command outputs a CompletedProcess object that can be stored for latter use. We can also use it immediately to retrieve the output of our command. For this we need to pipe the standard output of the command to the stdout property of the object returned by subprocess.run. This is done with the second argument stdo
A similar example from Windows 7 (I guess WIndows 10 would be the same):
will output a string containing the content of the default directory for your Python script. Note we must use the shell=True argument in this case. It first launches a shell, then runs the command in that shell.
Let's now run Vowpal Wabbit. We assume that the Vowpal Wabbit executable vw is accessible in our system path. One way to check it is to just type vw in a terminal. The following code snippet runs it with the above data file as input:
Let's look at the code we wrote. We store the CompletedProcess object returned by the subprocess.run command in a variable for later use. We then print the stdout property of that object. We redirect the standard error as well as the standard output, with the argument stde
We want to run Vowpal Wabbit in a shell, as this is what it expects. This is done via the shell=True argument.
Last, the check=True argument is set to true in order to trigger a Python exception if the sub process command return code is different from 0. This is the only way to make sure that the command executed properly.
This code prints:
This is a typical Vowpal Wabbit output.
The above code looks nice and handy, but it has a major drawback. It prints the output when the sub process command completes. It does not let you see the current output of the command. This can be frustrating when the underlying command takes time to complete. And as machine learning practitioners know, training a machine learning model can take a long long time.
Fortunately for us, the subprocess packages provides ways to communicate with the sub process. Let's see how we can harness this to print the output of the sub process command as it is generated. Instead of using the run command, we use the Popen constructor with the same arguments, except for check=True. Indeed check=True only makes sense if you run the sub process command to completion.
We then parse the output line by line and print it. There is a catch however: we need to stop at some point. Looking at the above output, we see that Vowpal Wabbit terminates its output with a line that starts with total. We check for this, and stop reading from the Popen object. We then print all existing output before completion. We use rstrip to remove trailing endline as print already adds an end line. An alternative would be to repl
The output of this code is the same as above, except that each line is printed immediately, without waiting for completion of the sub process.
You can use the same approach to run LightGBM or any other command line package from within Python.
I hope this is useful to readers, and I welcome comments on issues or suggestions for improvement.
Update on November 24, 2016. Added that code works fine on Windows 10 as pointed out by Irv Lustig. Also tested on Windows 7.