Difference Between Data Science And Machine Learning

Data Science VS Machine Learning

Introduction

It seems as though even companies along with their job descriptions have some confusion on what constitutes a data scientist and machine learning engineer. I am here to shed some insight into why they are separated and where the roles can overlap.

At first, studying to become a data scientist, I embarrassingly was unaware of what a machine learning engineer was. I quickly became to realize that the field is similar to data science but different enough to require a unique set of skills. Long story short, data science is the researching, building, and interpretation of the model you have built, while machine learning is the production of that model. Now that I have been in the field for a few years, gaining experience in both disciplines, I have developed an outline below on what is and what is not a data science and machine learning engineer role.

Data Scientist

A close look at a data scientist’s Jupyter Notebook.

A statistician? Kind of. Data science, in it’s simplest terms, can be described as a field of automated statistics in the form of models that aide in classifying and predicting outcomes. Here are the top skills that are required to be a data scientist:

Python or R

SQL

Jupyter Notebook

Python — To expound on the skills above, I believe most companies are looking for Python more than R. Some job descriptions list both; however, most people you are working with like the machine learning engineers, data engineers, and software engineers will not have familiarity with R. Therefore, I believe, to be a more holistic data scientist, Python will be more beneficial for you.

SQL, at first, can seem more like a data analyst skill — it is, but it should still be a skill you employ for data science. Most datasets are not given to you in the business setting (as opposed to academia), and you will have to make your own — via SQL. Now, there are plenty of subtypes of SQL; like PostgreSQL, MySQL, Microsoft SQL Server T-SQL, and Oracle SQL. They are similar forms of the same querying language, hosted by different platforms. Because these are so similar, having any of these is useful and can be translated easily to a slightly different form of SQL.

Jupyter Notebook could almost be the exact opposite of a machine learning engineer’s toolkit. A Jupyter Notebook is a data scientist’s playground for both coding and modeling. A research environment, if you will, allowing quick and easy Python coding that can incorporate commenting out of code, the code itself, and a platform to build and test models from useful libraries like sklearn, pandas, and numpy.

Overall, a data scientist can be many things, but the main functions are to

— meet with stakeholders to define the business problem

— pull data (SQL)

— EDA, feature engineering, model building, & prediction (Python and Jupyter Notebook)

— depending on workplace, compile code to .py format and/or pickled model

To find out more information on what a data scientist is, how much they make, the outlook of the field, and more useful information, click this link here [3] from UC Berkeley.

Machine Learning Engineer

The settings menu of Docker, commonly used by Machine Learning Engineers.

Now, after that last point above, is where a machine learning engineer comes in. The main function is to put that model into production. A data science model can be quite static sometimes, and an engineer can help to automatically train and evaluate that same model. They would then insert the predictions back into the data warehouse/SQL tables for your company. After that, a software engineer and UI/UX designer will display the predictions into a user interface — if necessary. As you can see, the whole process from business problem to solution in a visible, easy to use format, is not just the responsibility of a data scientist (however, yes, some data scientists can do all x amount of roles).

The role of a machine learning engineer can be also named ML ops (machine learning operations). A summary of their workflow would be something like this:

A. pkl_file of data science model

B. storage bucket (GCP — Google Cloud Composer)

C. DAG (for scheduling the trainer and evaluator of the model)

D. Airflow (visualizes the process — ML pipeline)

E. Docker (containters and virtualization)

At first, perhaps data science and machine learning could be seen as interchangeable titles and fields; however, with a closer look, we realize machine learning is more-so a combination of software engineering and data engineering than data science. Below, I will outline where the fields do and do not cross over.

To find out more information, visualizations, processes, click here [5], for a machine learning operations overview by Google

Similarities

Perhaps the most similar concept of data science and machine learning is that they both touch the model. The main skills that both fields share are:

SQLPythonGitHubConcept of training and evaluating data

The comparisons are primarily in programming; the languages each person uses to perform their respective roles. Both positions perform some form of engineering, whether that be a data scientist querying a database using SQL or the machine learning engineer using SQL to insert the suggestions or predictions from the model back into a newly labeled column/field.

Both fields require knowledge of Python (or R) and usually version control, code sharing, and pull requests through GitHub.

A machine learning engineer can sometimes want to know learn how the algorithms work like XGBoost or Random Forest, for example, and will need to look at the model’s hyperparameters for tuning in order to conduct research on memory and size constraints. While data scientists can build highly accurate models in academia or on the job, there can be more restrictions in the workplace due to time, money, and memory restraints.

Differences

Some of the differences are already outlined in the above sections of data science and machine learning, but there are some key features of both careers and academic research that are important to point out:

Data Science - focuses on statistics and algorithms

- unsupervised and supervised algorithms

- regression and classification

- interprets results

- presents and communicates results Machine Learning - focus on software engineering and programming

- automation

- scaling

- scheduling

- incorporating model results into a table/warehouse/UI

Comments