Data Science Tutorial: What Is Data Science

Here we will learn what is Data Science and understand the roles, responsibilities, education, and skill-set needed for Data Scientists:

Data science job had been called the most exciting job of the 21st century by Harvard.

There isn’t a standardized definition of data science, as the ideal experience and skill-set are quite rare to find in a single individual. We define it as the process of using a large amount of data to analyze, gain insights into the data, and make a decision to solve real-world problems.

What does this data scientist do at everyday work? Let us find out.

This data science tutorial gives related information meant to help understand the role of a data scientist, responsibilities, education, skill-set, and experience.

What Is Data Science

Data Science

To further simplify the conversation, a data scientist role answers the following questions:

  • What does the data tell me? (commonly known as Descriptive analytics)
  • Why is data trending in a certain way? (commonly known as Predictive analytics)
  • How to optimize data? (commonly known as Prescriptive analytics)

The following diagram shows some of the common roles that a data scientist may perform.

data scientist roles

Future Of Data Science

We feel it is here to stay at least for another two decades. Big data is a concept that is been implemented by many companies already to leverage the benefits of data-driven decisions. This data and big data go hand in hand as it is the responsibility of the data scientists to consume this big data information by creating software and algorithms.

The data scientist roles will continue to have a major impact, as big data had already been proven as a phenomenally successful investment.

Suggested reading =>> Big Data Vs Big Data Analytics Vs Data Science

The following images define the demand for data science taken from the research by Indeed (top job search engine).

employer demand for data scientists

job seeker supply of data scientists

Data Scientist Roles

Data scientists often come from diverse backgrounds, i.e., different skill sets, work experience, and educational backgrounds.

Following are the four fundamental areas a data scientist must be strong in:

  • Domain knowledge
  • Mathematics, Statistics (probability)
  • Computer Science
  • Soft skills

The following image summarises the skill set associated with each of the data science domains.

skill set associated with each of the data science domains

A data scientist role usually falls into the below categories:

  • Data engineering
  • Data visualization
  • Computer science
  • Maths and statistics
  • Machine learning
  • Cloud computing
  • Business intelligence

Data scientists are often strong in more than one or two of these areas, but usually not equally strong in all the roles.

Is Programming Must For A Data Scientist Role

To understand the importance of the programming part in such a project, one must first understand the requirements, goals, and outcomes associated with such a project.

Following are a few of the deliverables/outcomes from a data science initiative:

  • Data Classification (example, Confidential, Private, Spam, …)
  • Data Prediction (example, based on a given set of inputs predict the possible output).
  • Recommendations (example, Facebook and Google recommendations based on your history).
  • Data Recognition (example, is it an image, pdf, text, …).
  • Automated Decision Making (example, pre-loan limit approval, credit card approval based on history).
  • Risk Analysis (example, RISK rating).

While it is possible to do a lot of the above data work using business intelligence tools that have graphical interfaces (such as Tableau, Microsoft Power BI, etc) but in no way they can compete with programming languages such as R or Python.

Technologies And Techniques

All these initiatives at some stage will use programming to automate and create pipelines for delivering the goals. There are a wide variety of technologies and techniques used to achieve its goals.

Here is a shortlist:

Programming Languages:

  • Python: Libraries such as NumPy, pandas, Matplotlib, etc are heavily used in data science.
  • R: It is the first choice of programming to solve statistics and data mining problems.

Frameworks:

  • TensorFlow: Machine learning models by Google.
  • Pytorch: Machine learning models by Facebook.

Techniques:

  • Machine Learning & AI
  • Linear Regression
  • Logistic Regression
  • Clustering

Data Visualisation:

  • Tableau
  • PowerBI by Microsoft
  • Google Charts by Google
  • Plotly

Platforms:

  • Anaconda
  • MATLAB
  • IBM Watson

Data Science Strategy

Each team is different, and each strategy is different. There are common roles/personas that are quite common across all data science projects.

 Strategy

Here, we have five high-level personas. Analysts and Data Scientists are focused on data exploration and data analysis. Developers, layer into that, i.e., collaborating and publishing, as they work with the analysts and the data scientists. Finally, there is the deploy and operate phase, which is focused more on the Data Engineers and the DevOps systems engineering team.

Each of these personas comes together to be able to have a fully functional environment.

A modern strategy or a workflow can be broken down into the following stages:

  1. Data Exploration
  2. Data Model
  3. Prediction
  4. Collaboration
  5. Deploy

Let us review them below.

#1) Data Exploration

The exploratory stage is when you start looking at the data that is available, stretching a bit to find out where the data is generated and obviously how, looking at its structure, understanding the relationships that may exist, or that could be made out of the data and finally discuss the opportunities that might be found for providing insights.

#2) Data Model

In this stage, it’s time to build a model that is able to capture those characteristics, that can make the connections between the data, and can then be used to analyze that data effectively.

Data modeling helps in communication between data analysts and the users of the systems using the visualization technique and between data analysts and those who will design and build systems for those users.

#3) Prediction

Once you have a data model in place, it is possible to leverage it to generate predictions. The predictive model helps you not just understand what has happened in the past, and explore historical information, but to project forward, and to imagine scenarios of what might be.

#4) Collaboration

In this stage, you start sharing your data, data model and start collaborating within your team or within your organization. If you are working publicly outside the organization, you may use GitHub (the most popular) and start collaborating with others interested.

#5) Deploy

This is the final stage where you deploy the model once

Data Science Framework

Learning this and implementing a successful data science project is a long road and not everyone makes it, so it is very crucial to have a framework in place prior to implementation.

The data science best practice framework provides reusable templates, user guides the teams need to follow, and the required resources that enable teams to initialize the project effectively and efficiently, explore the data, develop models, and finally deploy/monitor models in production within a well-defined and well-documented manner.

A typical data science framework must define the standards for each of the modules defined in the following diagram.

Framework

This framework defines the goals for each phase and maps them back to business benefits.

  1. Project Initialisation
  2. Data Exploration
  3. Model Development
  4. Model Validation
  5. Model Deployment and Monitoring

 process workflow

#1) Project Initialisation

Goals of this phase:

  • Agree on the project goal and outcomes with all the relevant stakeholders.
  • All required accesses for all team members are granted.
  • Agree on the initial data sources you require.

#2) Data Exploration

Goals of this phase:

  • Management:
    • Document and inform stakeholders of issues/limitations of the data.
    • Are the features for model development fully agreed upon and approved?
  • Data:
    • Deep dive and understand the data available and validate whether it is sufficient to proceed with modeling.
    • Understand the data domain.
    • Understand the data limitations.
    • Develop features for input to modeling.
  • Model validation:
    • Confidence that your train/test split has no data leakage.

#3) Data Model Development

Goals of this phase:

  • Management:
    • Document and inform about the final model.
    • Gather the metrics and SLA required for the model to pass.
  • Model:
    • Develop a model that passes the benchmarks, which also passes the coding standards and security requirements.

#4) Data Model Validation

Goals of this phase:

  • Management:
    • The final model is validated and passed all the user acceptance testing.
  • Model Validation:
    • Validate the model that is fit for the purpose.
  • Testing:
    • The model developed is fully tested and all the stakeholders approved the results

#5) Model Deployment & Monitoring

Goals of this phase:

  • Model Deployment:
    • Finalize the model that can be successfully deployed to the execution environment.
  • Monitoring The Model:
    • Logging of the model
    • Monitoring the model

Business Value Framework (BVF)

The business value outcome for this initiative is crucial and goes hand in hand with each phase of a data science project. As you progress through each phase of model development, you need to adhere to the data science process standard and business value framework.

BVF ensures any new data science use case and data lab requirements can identify and measure the value in terms of investment and improvement.

BVF

Frequently Asked Questions

Q #1) What is data in science in plain English?

Answer: It is about how we take data, use it to gain knowledge, and then use that knowledge to:

  • Make decisions.
  • Predict the future.
  • Understand the past history/present trends.
  • Create new products (maybe new drugs for a parametrical company)

Q #2) What skills are required for becoming a data scientist?

Answer: A piece of strong knowledge of maths and statistics, programming and databases, and most importantly domain knowledge are fundamental skills required in this space. If you know R or Python and a little bit of SQL, you’re already in a pretty good position to get into data science.

Q #3) What are the most popular programming languages in data science?

Answer: You must learn both R and Python. The cloud vendors and younger generation are going with Python, but there are certain things only R can do.

Q #4) Can anyone become a data scientist?

Answer: Many big companies require an advanced degree like a Master’s in data science or a Ph.D. to lead in this role. However, there are many people who learn the skills on the job or by contributing to open source without a degree and still crack the roles.

Q #5) How to acquire data scientist’s skills?

Answer: Well, it depends on the background you got and what time frames you are looking for to become a data scientist.

How to acquire data science skills

Conclusion

The data scientist role has become an extremely important and highly in-demand role that can make or break the business. Each business collects huge amounts of data, and most of the time it is either ignored or utilized very poorly.

This strategy helps to fully leverage all the data collected and helps in making a profound data-driven decision.

Finally, Harvard was very right in defining the data science role as the most exciting job of the 21st century.