Numerous different data-related jobs exist today, and it’s not clear what skills the right candidate should be expected to possess. In this post I will share my understanding of the distinctions in skills and expectations for each job. Specifically, I will consider the following job titles:
- Data Scientist
- Data Analyst
- Business Intelligence (BI) Developer
- Data Engineer
- Machine Learning (ML) Engineer
- Algorithms Engineer
- Data Architect
End-to-End Data Science Workflow
To understand distinctions between different data jobs, it is helpful to consider the entire data science workflow end-to-end:
“Organize Sources” — organize and curate internal and external data sources.
“Frame Question” — understand business needs and “translate” them into mathematics and statistics.
“Construct Dataset” — combine curated data sources to identify the data subset that is relevant for coming up with an answer to the business problem.
“Train Model” — apply DS and ML techniques to tease out the answer from the relevant data.
Use answers to make better business decisions in one of two ways:
“Present Results” — if answers are meant for consumption by human decision-makers:
- “Translate” DS answers back into recommendations that non-technical business stakeholders can understand.
- Present findings to key business decision-makers and drive adoption of recommendations.
“Deploy Model” — if answers are meant to inform other algorithms:
- Wrap the ML model training and prediction operations into a full set of CI/CD operations to ensure continuous observability and monitoring of the entire workflow. Today this process is commonly referred to as MLOps.
Armed with this roadmap, I will attempt to map different job roles to the stages of the above process.
Data Scientist is probably the most vaguely defined job role. I find it helpful to think of all other jobs on this list as “data science with a focus on \(X\)”. This also means it is possible to find data scientists working on every stage of the above workflow, from Organizing Sources all the way to Deploying Models. Nevertheless, I would venture to say that most data scientists will primarily focus on stages 2 (Frame Question) through 5.1 (Present Results). Most of their time will be spent in Framing Questions and Constructing Datasets.
In terms of technologies and programming languages I expect a data scientist to know SQL and either R or Python (and often both). The exact SQL flavor, such as Postgres, Oracle, Microsoft, or SparkQL is not relevant: the differences are mostly cosmetic.
I think of data analysts as of junior data scientists. In particular, I would not expect a data analyst to have deep knowledge of statistics and ML techniques. Data analysts usually get assigned tasks that cause them to spend lots of time Constructing Datasets, followed by fairly straightforward analyses.
Technology-wise, data analysts should know SQL, and everything beyond that would be “nice to have”. They will likely be more productive armed with some knowledge of R or Python, but this knowledge is usually not expected.
Data engineers are the under-appreciated heroes of the big data age.
They tend to spend most of their effort in Organizing Sources and Constructing
Datasets, and primarily work on making sure the data pipelines that move data
A to destination
B are stable, robust, and fault-tolerant.
Data engineers are responsible for connecting several different data storage
and manipulation systems together, which often amounts to coming up with
coming up with creative ways of putting square pegs into round holes.
One can compare data engineers with plumbers, in the sense that their work
tends to attract attention when contents stop flowing through the pipes.
Skillful data engineers end up having to know a lot of languages and frameworks. SQL is unavoidable, since this is the language of all databases. In addition, knowing some Scala, Python, and Java is often necessary. And in many cases, moving data between two big systems will often require dealing with some kind of internal system-specific language (such as DAX in Microsoft PowerBI.)
I think of BI developer as someone who is both a data analyst and a data engineer. BI developer (or BI engineer) usually spends a lot of time on Organizing Sources and, to a lesser extent, Constructing Datasets. At Amazon in particular, BI engineers were a mix between data analysts, data scientists, and data engineers. I would expect a person in this role to be more technical than a data analyst, less technical than a data engineer, and less knowledgeable in statistics and machine learning than a data scientist.
Speaking of technical skills, I would absolutely expect a BI developer to be very comfortable with SQL, and also comfortable setting up and configuring integration points between various components of the data stack.
The primary stages on which ML engineer spend their time are Training Models and Deploying Models. This is not to say that they won’t be doing any of the earlier stages such as Framing Questions or Constructing Datasets, but in practice model deployment stage takes a lot of time and iterations. I would generally not expect candidates for other jobs to be intimately familiar with MLOps. On the same note, I would absolutely expect ML engineering candidates to have some MLOps chops, and, more importantly, be eager and willing to develop deeper expertise on the subject. Another way to think about this, is that most data scientists are expected to write ML code, while ML engineers are expected to develop ML software - there are generally higher expectations regarding code quality and performance when it comes to ML engineer’s fruits of labor.
ML engineers are expected to be skilled in Python. They are often also experienced with shell scripting and frameworks that enable Infrastructure-as-Code development patterns, such as Terraform or Kubernetes. Practically this translates into writing YAML files to orchestrate CI/CD pipelines.
Personally I have not seen this job title used very often, so take everything
that follows with a grain of salt.
My guess would be that an algorithm engineer is similar in skills and
responsibilities to an ML engineer, with one big difference.
They focus primarily on Training Models, and often would be expected to
implement novel ML algorithms from scratch, rather than use existing
off-the-shelf implementations such as
XGBoost or LightGBM. I suspect that most algorithm engineers spend little time focusing on Framing Questions or Constructing Datasets. Their talents come in handy when the overall ML solution architecture has taken shape and its performance becomes a bottleneck.
For algorithm engineers knowing Python is usually not enough. Often, they need to be comfortable with a more performant language such as C++, C#, or Java.
A data architect is someone who has a lot of experience working on end-to-end data systems. They have seen such systems evolve organically and become increasingly more complex and opaque over time. And they have learned anti-patterns and failure modes at the overall system level (30,000 feet or more). And the job of a data architect is to help others who are designing new data systems avoid some of these anti-patterns and mistakes. In terms of stages, they spend a lot of time thinking about Organizing Sources and Deploying Models. In addition, there is often a system design aspect to facilitate work of Training Models that helps other data professionals on the project to be more productive.
For data architects, there tends to be less focus on specific programming languages, and more focus on understanding trade-offs between several alternative system layouts. Data architects do not get to write a lot of code on a day-to-day basis, instead they spend a lot of time on design reviews and discussions.
The following heatmap summarizes the preceding narrative. Job titles are in rows, and data science project stages are in columns. Darker cells indicate deeper involvement of a particular professional in a given stage of the process.