Latest posts

What Does an Ideal Data Scientist’s Profile Look Like?

Findings from Analyzing 1000 Indeed Job Postings

(This is part 2 of my Data Science Careers project. You can find the first part here.)

If you are a Data Science job seeker, you must be wondering all the time what skills to put on your resume to get calls; if you are looking to get into the field, you may have scratched your head many times wanting to know which technologies to learn to be an attractive candidate.

Read on, I have the answer for you.

First, we look at the skill requirements for different job titles. (charts follow)

No More Debate between Python and R Since Python Is the Dominating Leader Now

There once was a debate of whether Python or R is the language of choice in Data Science. Clearly the demand in market is telling us that Python now is the leader. It’s also worth noting that R got even fewer mentions than SAS. Therefore, if you are considering getting into Data Science, consider focusing your learning efforts on Python. SQL as the language of database (and maybe data too!), comes as the second most important language for Data Scientists. Because of the broad nature of the Data Scientist profession, other languages also play import roles.

In summary, the top languages for Data Scientists are: Python, SQL, Scala, Lua, Java, SAS, R, C++ and Matlab.

Languages Required for Machine Learning Engineers are More Diverse

Python as the de facto language of Machine Learning comes unsurprisingly as the top language for Machine Learning Engineers. Because of the need to implement algorithms from scratch and deploy ML models in big data environments, relevant languages such as C++ and Scala are also important. Overall, it seems that the need of languages are more spread out compared with the other two roles.

In summary, top languages for Machine Learning Engineers are: Python, Scala, Java, C++, Lua, SQL, Javascript, Matlab, CSS and C#.

SQL Is the Absolute Must If you Want to Be a Data Engineer

Data Engineers deal with database all the time and SQL is the database language, so no wonder SQL is the top language. Python is important, but still loses to Scala and Java since these languages help Data Engineers handle big data.

In summary, top languages for Data Engineers are: SQL, Scala, Java, Python and Lua.

Scala is Emerging as the Second Most Import Language in Data Science (not R)

When we examine across different roles, interestingly, Scala comes up as either second or third. So we can say the top three languages in Data Science are Python, SQL and Scala. If you are thinking of learning a new language, consider Scala!

Spark is the Top Big Data Skill Except for Data Engineers

For Data Engineers only, Hadoop is mentioned a bit more than Spark, but overall, Spark is definitely the big data framework one should learn first. Cassandra is more important for engineers than scientists, while Storm seems to be only relevant for Data Engineers.

In summary, the top Big Data technologies for data science are: Spark, Hadoop, Kafka, Hive.

TensorFlow is the King When It Comes to Deep Learning

Deep Learning frameworks are hardly mentioned in Data Engineer job postings, thus it appears DL frameworks are not required for this role. The most mentions of DL frameworks come from Machine Learning Engineer roles, indicating ML Engineers do deal with Machine Learning modeling a lot, and not just model deployment. Furthermore, TensorFlow is definitely dominating in the deep learning field. Although Keras as a high-level Deep Learning framework is really popular for Data Scientists, it’s almost irrelevant for Machine Learning Engineer roles, probably indicating ML practitioners mostly use lower level frameworks such as TensorFlow.

In summary, the most important Deep Learning frameworks in Data Science are: TensorFlow, Torch, Caffee, and MXNet.

AWS dominates across the board

Computer Vision is Where Most of the Demand Comes from in Machine Learning

For general Data Scientists, Natural Language Processing is the biggest ML application area which is followed by Computer Vision, Speech Recognition, Fraud Detection and Recommender Systems. Interestingly, for Machine Learning Engineers, the biggest demand comes from Computer Vision only, with Natural Language Processing as the remote second. On the other hand, Data Engineers are again the focused specialists — none of these ML application areas are relevant for them.

Insight — If you want to become a Data Scientist, you can choose various types of projects to build to show your expertise based on the area you want to get into, but for Machine Learning Engineers, Computer Vision is the way to go!

When It Comes to Visualization, Tableau is a Must

Visualization tools are mostly demanded for Data Scientists, and gets very few mentions for both Data Engineers and Machine Learning Engineers. However, Tableau is the top choice for all the roles. For Data Scientists, Shiny, Matplotlib, ggplot and Seaborn seem to be equally important.

Git Is Important for Everyone, While Docker is Only for Engineers

Next, we use word clouds to explore the most frequent keywords for each role and combine with the corresponding skills to build the ideal profiles for all the Data Science roles!

Data Scientist is More about Machine Learning than Business or Analytics

Data Scientist has been regarded as the all-around profession that requires statistics, analytics, machine learning and business knowledge. It seems that’s still the case, or at least, there are still various needs in a Data Scientist. However, it definitely seems now Data Scientists are more about Machine Learning than anything else.

Other top requirements include:business, management, communication, research, development, analytics, product, technical, statistics, algorithm, models, customer/client and computer science.

Machine Learning Engineers are about Research, System Design and Building

Compared with general Data Scientists, Machine Learning Engineers definitely seem to have a more focused portfolio which includes research, design and engineering. Clearly solution, product, software and system are the dominating theme. Accompanying those, there are research, algorithm, ai, deep learning and computer vision. Interestingly, terms such as business, management, customer and communication also seem to be important. This can be further investigated in a further iteration of this project. On the other hand, pipeline and platform also stand out, confirming common understanding of Machine Learning Engineer’s responsibility in building data pipelines to deploy ML systems.

Data Engineer Is the Real Specialist

Data Engineers have an even more focused portfolio than Machine Learning Engineers. Clearly, the focus is to support product, system and solution through designing and developing pipelines. Top requirements include technical skills, database, built, testing, environment, and quality. Machine learning is also important, possibly because the pipelines are mainly built to support ML model deployment data needs.

That’s it! I hope this project helps you understand what employers are looking for, and most importantly helps you make informed decisions about how to customize your resume and what technologies to learn! If you like the post, I would appreciate your claps, thank you!