Making Sense of Data-Related Job Titles - Konstantin Golyaev's Website

Introduction

Numerous different data-related jobs exist today, and it’s not clear what skills the right candidate should be expected to possess. In this post I will share my understanding of the distinctions in skills and expectations for each job. Specifically, I will consider the following job titles:

Data Scientist
Data Analyst
Business Intelligence (BI) Developer
Data Engineer
Machine Learning (ML) Engineer
Algorithms Engineer
Data Architect

End-to-End Data Science Workflow

To understand distinctions between different data jobs, it is helpful to consider the entire data science workflow end-to-end:

“Organize Sources” — organize and curate internal and external data sources.
“Frame Question” — understand business needs and “translate” them into mathematics and statistics.
“Construct Dataset” — combine curated data sources to identify the data subset that is relevant for coming up with an answer to the business problem.
“Train Model” — apply DS and ML techniques to tease out the answer from the relevant data.
Use answers to make better business decisions in one of two ways:
1. “Present Results” — if answers are meant for consumption by human decision-makers:
  - “Translate” DS answers back into recommendations that non-technical business stakeholders can understand.
  - Present findings to key business decision-makers and drive adoption of recommendations.
2. “Deploy Model” — if answers are meant to inform other algorithms:
  - Wrap the ML model training and prediction operations into a full set of CI/CD operations to ensure continuous observability and monitoring of the entire workflow. Today this process is commonly referred to as MLOps.

Armed with this roadmap, I will attempt to map different job roles to the stages of the above process.

Data Scientist

Data Scientist is probably the most vaguely defined job role. I find it helpful to think of all other jobs on this list as “data science with a focus on \(X\)”. This also means it is possible to find data scientists working on every stage of the above workflow, from Organizing Sources all the way to Deploying Models. Nevertheless, I would venture to say that most data scientists will primarily focus on stages 2 (Frame Question) through 5.1 (Present Results). Most of their time will be spent in Framing Questions and Constructing Datasets.

In terms of technologies and programming languages I expect a data scientist to know SQL and either R or Python (and often both). The exact SQL flavor, such as Postgres, Oracle, Microsoft, or SparkQL is not relevant: the differences are mostly cosmetic.

Data Analyst

I think of data analysts as of junior data scientists. In particular, I would not expect a data analyst to have deep knowledge of statistics and ML techniques. Data analysts usually get assigned tasks that cause them to spend lots of time Constructing Datasets, followed by fairly straightforward analyses.

Technology-wise, data analysts should know SQL, and everything beyond that would be “nice to have”. They will likely be more productive armed with some knowledge of R or Python, but this knowledge is usually not expected.

Data Engineer

Data engineers are the under-appreciated heroes of the big data age. They tend to spend most of their effort in Organizing Sources and Constructing Datasets, and primarily work on making sure the data pipelines that move data from source A to destination B are stable, robust, and fault-tolerant. Data engineers are responsible for connecting several different data storage and manipulation systems together, which often amounts to coming up with coming up with creative ways of putting square pegs into round holes. One can compare data engineers with plumbers, in the sense that their work tends to attract attention when contents stop flowing through the pipes.

Skillful data engineers end up having to know a lot of languages and frameworks. SQL is unavoidable, since this is the language of all databases. In addition, knowing some Scala, Python, and Java is often necessary. And in many cases, moving data between two big systems will often require dealing with some kind of internal system-specific language (such as DAX in Microsoft PowerBI.)

BI Developer

I think of BI developer as someone who is both a data analyst and a data engineer. BI developer (or BI engineer) usually spends a lot of time on Organizing Sources and, to a lesser extent, Constructing Datasets. At Amazon in particular, BI engineers were a mix between data analysts, data scientists, and data engineers. I would expect a person in this role to be more technical than a data analyst, less technical than a data engineer, and less knowledgeable in statistics and machine learning than a data scientist.

Speaking of technical skills, I would absolutely expect a BI developer to be very comfortable with SQL, and also comfortable setting up and configuring integration points between various components of the data stack.

ML Engineer

The primary stages on which ML engineer spend their time are Training Models and Deploying Models. This is not to say that they won’t be doing any of the earlier stages such as Framing Questions or Constructing Datasets, but in practice model deployment stage takes a lot of time and iterations. I would generally not expect candidates for other jobs to be intimately familiar with MLOps. On the same note, I would absolutely expect ML engineering candidates to have some MLOps chops, and, more importantly, be eager and willing to develop deeper expertise on the subject. Another way to think about this, is that most data scientists are expected to write ML code, while ML engineers are expected to develop ML software - there are generally higher expectations regarding code quality and performance when it comes to ML engineer’s fruits of labor.

ML engineers are expected to be skilled in Python. They are often also experienced with shell scripting and frameworks that enable Infrastructure-as-Code development patterns, such as Terraform or Kubernetes. Practically this translates into writing YAML files to orchestrate CI/CD pipelines.

Algorithm Engineer

Personally I have not seen this job title used very often, so take everything that follows with a grain of salt. My guess would be that an algorithm engineer is similar in skills and responsibilities to an ML engineer, with one big difference. They focus primarily on Training Models, and often would be expected to implement novel ML algorithms from scratch, rather than use existing off-the-shelf implementations such as
XGBoost or LightGBM. I suspect that most algorithm engineers spend little time focusing on Framing Questions or Constructing Datasets. Their talents come in handy when the overall ML solution architecture has taken shape and its performance becomes a bottleneck.

For algorithm engineers knowing Python is usually not enough. Often, they need to be comfortable with a more performant language such as C++, C#, or Java.

Data Architect

A data architect is someone who has a lot of experience working on end-to-end data systems. They have seen such systems evolve organically and become increasingly more complex and opaque over time. And they have learned anti-patterns and failure modes at the overall system level (30,000 feet or more). And the job of a data architect is to help others who are designing new data systems avoid some of these anti-patterns and mistakes. In terms of stages, they spend a lot of time thinking about Organizing Sources and Deploying Models. In addition, there is often a system design aspect to facilitate work of Training Models that helps other data professionals on the project to be more productive.

For data architects, there tends to be less focus on specific programming languages, and more focus on understanding trade-offs between several alternative system layouts. Data architects do not get to write a lot of code on a day-to-day basis, instead they spend a lot of time on design reviews and discussions.

Summary

The following heatmap summarizes the preceding narrative. Job titles are in rows, and data science project stages are in columns. Darker cells indicate deeper involvement of a particular professional in a given stage of the process.