5 Most Common Mistakes Data Scientists Make When Handling Data

Albert Christopher
5 min readMay 9, 2022

--

Well-managed analytics initiatives can result in gold for your organization. However, if you make one of these typical mistakes, your data science activities can spiral out of control very quickly.

Data science is one of the most in-demand careers right now, and for a solid reason. Every day, around 2.5 quintillions of data are created! Both Glassdoor’s list of best careers in 2021 and LinkedIn’s Emerging Positions Report 2021 included data science jobs.

With a median pay of USD 107,801 and a promising future, data science is attracting a lot of job seekers.

Addressing the elephant in the room, achieving the designation of a data scientist is no easy feat, data science experts with the right kind of data science skill sets are hard to find, curtsy, the demand for proficient and latest data science skills.

Data science blunders that should not be ignored

Statistics, math, machine learning, and data visualization with R, Java, SQL, or Python are all required and vital skills for data scientists. Several online video tutorials and courses do not cover all of the needs of the sector. As a result, there are a few common blunders that rookie data scientists make.

I have achieved several major milestones in my career as a data scientist, but I have also made several mistakes along the way. Let us consider some of the blunders most often made in data science so we can learn from them and help individuals who are interested in the area succeed.

Let us start by looking at a case study to see how mistakes, big or small, can lead to massive disasters for businesses.

Case study of Microsoft Tay Bot

On Twitter, Microsoft launched a chatbot dubbed “Tay” in March 2016. Tay was supposed to talk like a youngster, but it only lasted a day as it began tweeting bigoted and hateful things on social media. Tay learned how to speak with individuals depending on who she was talking to as an artificial intelligence system.

Microsoft said that the racist comments were caused in part by online “trolls” who were attempting to force the technology into racist chats after shutting her off for her racist comments.

Since 2016, the firm has tweaked its AI models and produced a new “lawyer bot” that can provide legal assistance to users via the internet. According to a spokeswoman, Tay’s issue stemmed from the “content-neutral algorithm” and key issues like “how can this harm someone?” should be explored before implementing these types of AI initiatives.

It is necessary to take heed of the current or potential mistakes data scientists may make shortly and which could be avoided.

1) Missing data annotations and using corrupted data

Collecting and cleaning data takes up 60 percent of a data scientist’s effort. This is the least delightful task, but it is a necessary step. All subsequent processes must be carried out on clean data that serves as the foundation for a machine learning task.

Data annotation is the process of appropriately classifying data in preparation for machine learning. To build ML models, data scientists require a huge volume of accurately annotated data, notably picture and video data.

Working with corrupted data lacking data annotations is analogous to attempting to bake cookies without the proper ingredients. Will your cookies be crisp and delicious? No!

As depicted in the aforementioned diagram, corrupted data leads to inaccurate model construction. For accurate model creation, data must be free of mistakes and outliers.

2) Analyzing without any plans or questions

Before you begin the analysis, you must first decide on the direction you want to go and the technique you will employ. Any data science should begin with a clearly defined goal. Data scientists sometimes jump right into modeling and analysis without first considering the problems they are seeking to solve.

“Why?” is the question that data scientists try to answer and not “what.” When answering “why” queries, data scientists must be clear about their aims.

For instance:

You must first determine whether the problem you’re trying to answer is an unattended (or unstructured) ML problem or a monitored (or structured) one before you begin working on any project. You won’t be able to assess whether the answer works unless you know what the problem is.

When data scientists don’t know what they’re looking for, they frequently provide unsatisfactory results. To achieve your goal, you must ask yourself certain questions.

3) Using identical functions for a variety of issues

Since this would be totally hypothetical, different functions cannot be applied to the same issue. Some rookie data scientists may be tempted to use the same courses, functions, tools, and so on for each challenge.

Every problem is different, and each solution should reflect that. Text data, time series data, and other types of data must all be processed differently.

Because each problem is distinct, each solution must be as well. There are numerous forms of data, each of which requires its own treatment. Natural Language Toolkit (NLTK) and other NLP libraries exist in the same way as machine learning libraries do. To handle photographs and videos, we use a convolutional neural network and time series analytic techniques.

Homogeniously, the SciKit-Learn library has numerous problem solving operations and functions. For computer vision challenges involving image recognition, data scientists cannot use natural language processing (NLP) libraries, and vice versa.

4) Not considering a model as a component of a life-cycle

This is something that many data scientists overlook, because more than half of projects never make it to production and remain in the Proof Of Concept (POC) stage.

The lifecycle of a machine learning model begins with the business need and proceeds through the basic sequence:

  • Training an ML algorithm
  • Evaluating and testing algorithms with the proper metrics
  • Deploying them with minimum performance standards (latency) is followed by model monitoring, training, and feedback.

Each level comes with its own set of technological requirements. As a data scientist, you will be asked a lot about data training and exploration, but keeping the larger picture in mind will help you make the right judgments early on.

If you know your client’s infrastructure has restricted resources, for example, you can design your model with this limitation in mind from the start: A simpler design, for example, could allow you to make faster inferences.

5) Paying little to no attention to communication skills

This is perhaps the most common blunder made by data scientists. Solving a data science problem and then communicating it to a non-tech audience is a different skill.

Presenting your findings to stakeholders is an important element of being a data scientist in a company and being able to pivot from a technical discourse to showcasing a commercial value conveyed in human words is incredibly beneficial.

You’ll almost certainly showcase your work to a commercial sponsor at some point. These individuals are not technical and will never be part of your team. They only pay attention to what is important to them. So here’s my advice: be straightforward, simple, and to the point.

To conclude

Every new problem is an opportunity to learn and grow as a data scientist. When you are starting in your profession, do not be scared by these blunders. They will undoubtedly educate you on how to deal with various machine learning challenges in practice.

Mistakes happen, and act as a means of advancement, what is important is to learn from them, and never make the same error twice!

--

--

Albert Christopher

AI Researcher, Writer, Tech Geek. Contributing to Data Science & Deep Learning Projects. #coding #algorithms #machinelearning