Share this post:
A common question that I get from my students is that: “what is the difference between a Data Analyst and a Data Scientist?”
I would argue that the task at hand differs. It differs because being a data scientist does not make one master of the universe. When we say data science is team-work, it means that the team includes a data journalist who is involved with data collection and data wrangling activities. The data engineer likely works with Python and strives to bring forth meaningful visualizations of the data. The data analyst may perform both of those tasks and is well versed with SQL calls, understands the DBMS that is humming either on premise or in the cloud (and ever so prevalent, hybrid systems). Think Hadoop, big data and data mining skills.
The data scientist is the curious. They are the ones with a pain-point to resolve. The data scientist has a hypothesis to refute or validate (both are helpful). The data scientist ventures out of the office and feels the cold, the rain, takes measurements from the sensors out there.
Unlike the data analyst, the data scientist (DS) is also keenly involved with unstructured data. This means the DS is extracting insights and sentiment from tweeter feeds, from Facebook images perhaps to depict sudden onset of depression as a result of social distancing. Is that depression more prevalent with one cohort than another? How can I help? These insights are not in IBM DB2, MS SQL Server nor in Oracle data base….these data points are in our mobile devices.
The tools that the DS may use go beyond the brute statistics (Regression, Random Forest Trees, Bayesian Inferences) of SPSS or SAS; they employ deep learning techniques (CNN, RNN, LSTM, capsule networks, GANs) that use feature vectors for input. After all, all data points for a machine to use need to be normalized between 1 and 0. The system does not just see a cat, rather it is 1-hot encoding…it is a bunch of ones and zeros. Same is true if your input was a CSV file.
The reigning attribute that I would like to see in an aspiring data scientist is a sense of curiosity. One who goes around and asks ‘why’ all the time. Another key feature is an understanding of inferential statistics (think regression) and Calculus II (think partial derivatives and integrals). They can clearly see in their minds how an integral function is the opposite of a derivate function.
Python you say? Well it helps, but not on the top of my list. Nowadays, we snag code from existing Jupyter Notebooks and reuse those, perhaps just changing the value of x and y axis in the code. One thing that is a bit more prevalent with DS is use of opensource tools for running the math calculations (NumPy, Sci-Kit learn) and data visualizations (my favorite visualization, is the opensource Pixiedust…born and raised right here at IBM by an ex-IBM Distinguished Engineer, David Taieb.
The data scientist has a keen understanding of the confusion matrix and can interpret the distribution in a Receiving Operating Curve (ROC) where we peg the True Positive (x-axis) versus the False Positive (y-axis).
The data scientist is a scientist because they started with a hypothesis and they employed the scientific method.
The data scientist understands the value of Design Thinking. Realizes that there is such a thing as boiling the ocean water and a keen alignment of for whom exactly are we solving the said problem?
There is a shared task among all these roles that have the word ‘data’ in it: and that is, they all start the week as data janitors. The Harvard Business review that deemed Data Scientist as the sexiest job of the 21 Century forgot to mention that all that allure starts by Thursday, not on Monday. Lots of unglamorous data cleansing needs to be done before machine learning comes int play.
The data scientist understands that the winner of AI race is not the entity or country with epic amounts of data, nor the university or firm with the next big algorithm, it is doing maximum AI with minimum data. For example, I have a hunch it is going to be a good day…how much data did I need first thing in the morning to make that prediction? Good luck to the ML system in making that prediction using a hunch……for now!