17 March 2020 | Written by: ARMEN Pischdotchian
Categorized: Skills Development
Share this post:
Is there a distinction between a data scientist and a data analyst? Well, not exactly the same profession nor skill set I’d argue, with the exception of two common attributes, both beckon a curious mind and it all begins with data, and perhaps at first, the not so sexy job of the 21 century, rather, toiling as a data janitor, regardless of which title adorns your name.
If I Say Sky, You Say……?
I run workshops and lecture in universities on topics related to artificial intelligence and data science per my job description at IBM and each year the cross roads of machine learning and data wrangling of one sort or other gets ever so more pronounced.
Shortly after I take the podium, I tell my audience: “if I say sky, what would be the first thing that comes to mind?” they often reply, blue. I continue: “if I say grass what would you think of first?” green comes to mind readily. Then I say: “machine learning” few ponderous faces, and soon I set them free and proclaim: “prediction machines.” The ponderous gaze turns into a conundrum, and before the semi frowned eye brows have had a chance to rest I ask: “when I say data, what comes to mind right away?” I give them a few minutes, some mutter tepid responses that has me thinking, others bellow noun phrases with high confidence that has me re-examining my reckoning.
The five Vs of Data
Before the dust settles, I tell them: “the 5 Vs as in volume, variety, veracity, value and velocity.”
Now they are sitting back. Arrested moment. They can tell the lecture is about to begin in earnest.
© Copyright 2020 IBM Corporation
Of the five Vs, variety is of most interest to me. This is telling me that data does not just reside happily in columns and rows to be chewed on by Hadoop and extracted with SQL calls…it is not the volume of tabular data that is growing exponentially, rather unstructured data as in images, video, tweets, MRIs, CAT-scans, emails and so forth. That is the data that I want to infer insight from. It is one thing to know what it is that you don’t know…go ask a well-trained chatbot…but it is another thing to gain insight of what it is that I don’t know that I don’t even know. That’s what I’m after.
At this point I almost detect a smile on folks sitting upfront and the ones in the back have set down their coffee cups. A short pause, a silent stroll to the left and then to the right of the stage, my smile is gaining confidence and so I continue: The starting point of science is in gathering data…. because we are curious. We gather data on animals and plants, we gather data on minerals, on elements, even stars. Once we have gathered lots of data we begin curating, sanitizing and cleansing the data (Veracity), some purport that this thankless job of being a data janitor can consume upwards of 85% of a data analyst or data scientists job.
The Bold and the Curious
So now, I begin to classify my data. Typically, huge volumes of data are hard to label so I use unsupervised machine learning techniques, cluster them into Phylogenetic trees if it has to do with animals and plants, minerals are grouped into crystal groups, elements go into the periodic table, stars are plotted in the Hertzsprung-Russell diagram (temperature versus luminosity). Patterns begin to emerge. I now need to extrapolate, explain and visualize my findings. Now I’m wearing a data scientists cap. My development environment is Jupyter Notebooks, Python is the programming language. I use open, cheap and simple libraries such as Pandas, matplotlib, Pixiedust, OpenCV to render charts; Numpy and SciPy to do the statistics if I’m using brute statistical models such as Support Vector Machines, Regression, k-nearest neighbor or Calculus II constructs to plow through partial derivatives and integrals doing their magic in artificial neural networks.
Phylogenetic trees help me think that perhaps the ‘jump’ of life forms from the primordial oceans to land was not because, well they decided to jump one day, but because the tidal dance due to the moon, exposed some to air for six hours and back to water for the next six hours…do that a thousand years and you get marine creatures that figure how to oxidize using air and not just water.
Crystal groups lend to the study of crystallography.
Elements make us take a close look at silicon. It lives right underneath Carbon in the Periodic table. They have the same electron valence of 4. Silicon is a bit more metallic than carbon because it loses its electrons more easily. All life form is based on carbon on earth, DNA is based on carbon. So perhaps in the clouds of Jupiter under those silicon-friendly conditions, there can be life form based on silicon.
We look at our sun, 4.5 billion years through and roughly 4.5 billion years to go. It’s a G-type star. It will become a white giant first, engulf the first three inner planets halfway to the asteroid belt (Mars will once again be quite habitable) and collapse into a red dwarf as its hydrogen fuel depletes. If humanity does not destroy Earth first, then how long do we have on this planet? The curious wants to know.
There is utter silence in the entire room. I pause…no Power Point slides behind me. I am following David Kenny’s example, ex-boss of IBM Watson and Cloud, when he says: “the problem with Power Point is that it lacks power and does not have a point.”
An Earthly Example
It’s time to bring them to the home stretch. I offer an example that perhaps depicts the difference between a data analyst and a data scientist at an Earthlier level.
The Data Analyst
Consider the following use case. A wonderful company with a healthy and thriving culture set in a scenic country setting has been experience alarming levels of employee attrition. It’s not just that the human resources (eh, talent managers) have noticed, so have fellow employees.
Before long, the head of HR, taps a data analyst on the shoulder and gives her a giant spread sheet a thousand rows (employees) and hundreds of columns (attributes such as age, sex, education, distance from home, you name it…. whatever can be gathered under the current GDPR guidelines).
The Analyst takes the spreadsheet and feeds it to a black box (it’s a linear regression model) out comes colorful charts, scatter plots, bar carts, Pareto distribution. She applies a myriad of dependent variables to the constant of employee attrition and soon it emerges that employees below the age of 30 who live 20+ miles from work are the first to leave. They seem to be going to firms inside the bustling cities where they can ‘share’ a scooter while commuting to work.
As they say in Massachusetts (where I live), light dawns on ‘Marblehead’ for the HR folks. Let’s have the young, bold and the curious who live 20 miles plus from the office to work from home three days a week. But soon, similar cohort of folks who live down the road from the farm setting of the building find it unfair that they must commute every day, even though it’s 10 minutes. OK fine, you all can work from home…. now the building is a ghost town. The execs most typically come in…but the cubes are empty…. hmmm. Predictive, not exactly prescriptive.
The Data Scientist
Let’s say John, the big boss of Company X, who drives to work (and that’s fine with John) is starting to get annoyed each time he reaches the intersection of Main Street and Quagmire Road. Since he is coming down Quagmire Road (let’s say other routes are way more arduous to navigate) finds that each morning around rush hour he ends up sitting at that traffic light, at the intersection twice, sometimes three times. The cars pile up 12 deep and by the time he gets to the green light it turns red again.
Not happy, one day he finds a curious looking Data Scientist and tasks her with finding out just what is going on. She does the drive herself a few times and indeed the double whammy of the lights, heck even on the way back, the same thing. Each light is three minutes, so she spends 6 minutes at the same light on that intersection of Main and Quagmire.
Well, there are cameras abound. Street corner cameras and many more. With city and vendor permission she obtains the videos of the traffic on both streets and uploads them to a Box folder.
She uses CNN (Convolutional Neural Net) with LSTM (Long Short Term Memory) models and trains the system to recognize boxy things that have 4 or more wheels. Things with three or less wheels are friendly fire, leave those alone.
The algorithm builds patterns (it’s all Hue Saturation Value on vector graphics…no need for high def video). Frankly, just images in a time series capture will do.
OK, from 3:45 until 6:10 in the afternoon, there are a lot boxier things with 4+ wheels on Quagmire Road than there are on Main Street. What if she tweaked some of the hyper parameters (weights) where, instead of 3 minutes it was 2 minutes, maybe 2:10 minutes. Tweaking the model is what makes it not a black box. The variables are dictionaries and tuples in a list…think of it as an array…each given time frame predicts a certain number of vehicles stuck at the light. It turns out that the optimum time for waiting at a given red light with heavy traffic is 2:23 minutes. That way, there is 3.5% chance that any given driver will end up enduring two consecutive red lights. The algorithm must operate at the cusp of efficiency, because the less one group waits, the longer the other folks at the opposing light have to wait, got to make this as fair as possible.
Brilliant; but Sundays at that same time frame the light does not have to be 2:23 minutes nor on weekdays that happen to be holidays. That’s right dear readers, this is not a deterministic system, it’s a probabilistic system. It learns and adapts. Soon, the bold and curious data scientist takes her discovery to City Hall, shows them the fancy charts and in time they install “smart lights.” She begins to think that maybe reinforcement learning is a better approach and patents her code. Leaves Company X and gets a federal ID number for her startup. Now that’s prescriptive not just predictive.
As I scoot to my hotel, I start imagining a cognitive system that picks out the ‘curious’ traits.
Discover the In-Demand Skills to Advance from Programming to DevOps Engineering!
DevOps Engineers are becoming highly in demand in companies all over the world and can earn an average yearly salary of $130,000*! In IBM Applied DevOps Engineering Professional Certificate on Coursera, you’ll learn the latest DevOps practices, tools, and technologies from experts at IBM to get job-ready in less than 3 months. This program is […]
Prepare for a career as a Back-end Developer!
Launch a career in the high-growth field of computer science as a Back-end Developer in less than 6 months. In the IBM Backend Developer Professional Certificate on Coursera, you’ll develop the skills, experience, and portfolio to have a competitive edge in the job market as an entry level back-end developer. Through the 10 Back-end Development […]
Get ahead in your job competition with one of IBM’s Career Guides and Interview Preparation courses on Coursera. Free for a limited time!
Are you looking for an insider’s advantage to IT career interview and job tips? Check out Data Science, Data Engineering, Tech Support, Data Analyst, and Software Developer Career Guide and Interview Preparation courses on Coursera! Through these Career Guides, IBM will help you build your foundation of IT job hunting skills to enter the […]