Updated: Aug 24, 2019
Data science is a multidisciplinary field which requires knowledge of math, technology and domain.
Based on the business requirements, the analysis needed are:
Exploratory analysis is the process of analyzing the dataset to summarize or get an overview of it. It is often done with visual methods using libraries like matplotlib, d3.js and applications like tableau.
Predictive analysis is the major branch of data science where models are created using existing data to make predictions on future or unknown data.
Prescriptive analysis is like an extension of predictive analysis in the sense that it not only predicts what will happen, it also suggests decision options to change the outcome.
IPA analysis - Interpretative phenomenological analysis (IPA) is an approach which deals with psychological research.
Visualization - For exploratory analysis, tableau is a popular tool to create interactive data visualizations. D3.js is an open source library which is used to create visualizations inside web pages.
Programming Languages - Python, R are the most used languages by data scientists. Python is useful to create end-to-end product as it can be used to create websites. R is preferred for research purposes.
For dealing with large amounts of data, open source big data tools like spark, hive, hadoop are useful.
Data Science Life-cycle
The first step is to define the objective by discussing with customers or stakeholders to identify the business problems and define the target metric for the project.
Collecting the data
The next step is to acquire the relevant data by direct sources like analytics or from third party sources if necessary. High quality data is an important requirement of a data science project.
Understanding the data
Before training a model, it is important to explore the data first. Most of the data in production has missing values and errors, they should be dealt with domain knowledge and available algorithms. The data may also be normalized and transformed for better model training.
Creating a model
Out of all the columns available in the dataset, choosing the relevant columns is an important task, this is called feature engineering. It needs exploration of data and domain expertise to decide on the features to use for training the model.
Based on the problem statement of the project, there are different types of models available to choose. The models can be compared with each other by metrics like accuracy.
The model creation includes the following steps:
Split the data randomly into train, validation and test sets. Most commonly used approach is to use 50% - 70% of the data for training, 20% for validation and 10% for testing but this can vary based on the dataset
Build the model using training dataset and use the validation data to fine tune hyper parameters and retrain the model on training data.
Evaluate the model - After the model is finalised using training and validation dataset, evaluate the model accuracy on the test data.
Deploying the model
Decide whether the accuracy of the model is sufficient to use in production. If not, try training the data on different models and collect more data if necessary. Once the model is finalized, deploy the model to web to facilitate users to get predictions using their data. APIs can be used to get predictions from other applications as well.
About Data Science Authority
Data Science Authority is a company engaged in Training, Product Development and Consulting in the field of Data science and Artificial Intelligence. It is built and run by highly qualified professionals with more than 10 years of working experience in Data Science. DSA’s vision is to inculcate data thinking in to individuals irrespective of domain, sector or profession and drive innovation using Artificial Intelligence.
Data Science Authority | Data Science Training in Hyderabad