With data pre-processing in Deep Learning getting attention, i ventured to give the Tensorflow Transform (tf.Transform) library a try.
Why you might ask?
Well the examples (at least to me) didn't show how to achieve the end-2-end flow and that piqued an interest to get a working example of the flows diagrammed in the announcement note.
The goals
I visualized a flow as follows - source input would be some raw data. To exercise the pre-processing functionality, i would attempt to normalize values in some columns, turn categorical string values to integers in other columns and the likes. I expect the tf.Transform library to calculate statistics across the raw data and persist those stats. When scoring new data, i expect those stats values to be reused to first transform the new incoming data before making the prediction on it. If i achieved that, i would consider myself learned on using tf.Transform.
To achieve this flow, the isolation of environments during calculation of stats (training) versus the usage of it (predicting) needed to be clear and using TensorFlow Serving (tf.Serving) seemed be the perfect fit for that. tf.Serving is a real added bonus of the Tensorflow framework compared to others and getting my hands dirty with it seemed exciting.
The start (code reuse as always)
I looked up the examples that ship with tf.Transform. Beyond the hello world, the real world data examples were ones that used classic Machine Learning (yet to get to Deep Learning!). One of the things i had been looking at understanding is using TensorFlow for classic ML and so this example gave me the added advantage of digging into tf.contrib.Learn and its Estimator interface, the starting points for that. Some more nice learning thrown in for me.
But after going through the tf.Transform examples, i felt they were far from the end-to-end flow i had in mind (and the tf.transform examples could have thought about this!)
The questions that the tf.Transform examples didn't answer (working code)
- How to save/export the transform graph (to disk)
- How to link the transform graph on disk to a saved_model (trained and exported model), so that one can run a tf.Serving instance pointing to a exported model that included the transforms.
Hacking away
I started with the tf.Serving docs/github to get code for exporting a saved_model. Examples in tf.Serving are about exporting low-level computation graphs, none about how to export tf.contrib.learn models (the real world tf.Transform samples). So i worked with the TF beginners low-level example and successfully added the export model functionality to that.
With that done, test cases in tf.Transform github helped getting working examples of how to save and load/execute a simple transform graph in isolated standalone programs.
Integration problems
Now I tried to combine these pieces and move forward to have working code for the end-to-end flow. Stuck again!! Can't find how to persist the transform graph when it uses Apache Beam (used for statistics collection in tf.Transform) and run into a number of data type conversion errors when trying various variations of code to get this working (almost feel like taking up a proper python training course. Pride myself in having learnt many languages 'on the job'). And i still have the part of how to link the transform graph into the saved model.
Success on the shoulders of others
A bit frustrated and tired, I decide to 'google up' my problem with the right key words. I find ibrahimiskin facing the same plight as me and already opened a github issue a few days earlier. Happy to see that he is further along the journey than me, but still stuck on some final pieces, as well as not clear on how he solved other things. Well i can add my might by commenting that i am on the same journey and TF can help by providing end-to-end examples. ibrahimiskin was helpful to clear up how to export a Apache Beam transform function and looks like he was further spurred on to find the last piece of the puzzle a few days later.
My humble contribution
Well with all the help i got from others, the least i can do is provide the end-2-end working samples in my github repo to help any fellow journeymen coming after.
And personally i couldnt have asked for a better experience in getting to know TensorFlow hands-on, finding out additional API's that isn't in there in official samples and docs to get my end-2-end flow working.
P.S. The two tf.Transform example exported models (sentiment analysis and Income prediction) provide different output variables (or rather one more than the other). Probably how the tf.learn.Estimator opaquely exports the savedmodel. I might get enticed into digging into it.
Tags: 
datascience
deeplearning
machinelearning
tensorflow
pre-processing