Updated: Aug 24, 2019
Most of the companies are focusing on building the right data eco-system to power their data science and artificial intelligence teams. Let us first try to understand the data science eco-system of a company through a pyramid structure and identify various roles involved.
Every Company is not ready for AI
AI is at the top of a pyramid of needs. But, more often than not, companies are not ready for AI. The most common scenario is that they have not yet built the infrastructure to implement (and reap the benefits of) the most basic data science algorithms and operations. AI is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure) ready before utilizing the power of data. In this regard, every company has a pyramid of data needs, and your role as a data scientist/ engineer/analyst will fall somewhere along this spectrum. Understanding this eco-system is key in properly developing and articulating your current skills/responsibilities and plan for a career in the data science industry.
Different stages in the pyramid
The first and most important step in the process is an established Data Collection phase. Companies focus on collecting data generated through all possible sources (User-generated content through apps, Logs, Sensors…) so that it can be put to use by data science teams. Different tools are available in the industry to enable data collection depending on the type of data or source through which it is generated.
Most used big data tools for data ingestion: Sqoop, Flume, Kafka, Nifi.
“More Data generally beats better algorithms”
The next step is Data Storage. Once the data is collected from different sources, it needs to be processed and stored in the right format for enabling seamless access. Most of the companies are building enterprise-wide data platforms to store all the data in one place and build sandboxes on top of it for implementing data science use cases. Data flows through various stages/pipelines and generally end up in Hadoop enabled data platforms.
Most used tools for data storage: HDFS, Hive, HBase, Mongo DB, Cassandra, Tera Data, MySQL, DB2, and Oracle
The data should be cleaned and transformed before using it for analysis. The extracted data consist of Dummy values, contradictory data, and the absence of data points. Overall, data cleaning will not just provide a clean dataset but also brings data consistency across the platform. Data transformation can be simple or complex depending on the required changes in the data. Standardizing data, character set conversion, encoding handling, splitting or merging fields, conversion units of measurements into a standard format, aggregation, consolidation, delete duplicate data are some of the tasks involved in data transformation.
Once the data is available in the right format, Next companies start building what’s traditionally thought of as BI or analytics: define metrics to track, their seasonality and sensitivity to various factors. Maybe doing some rough user segmentation and see if anything jumps out. However, since the goal is AI, companies build features to incorporate in machine learning models later. At this stage, it is important to know what to predict or learn, so that data science teams can start preparing training data set through generating labels, either automatically (which customers churned?) or with humans in the loop.
Next, companies can deploy a very simple ML algorithm (like logistic regression), then think of new signals and features that might affect results. Bringing in new signals (feature creation, not feature engineering) is what can improve performance by leaps and bounds.
At this point, most of the companies are ready for Artificial Intelligence. They would have built dashboards, labels, good features and measuring the right things. Once the baseline algorithm that’s debugged end-to-end and is running in production — and changed it a dozen times, the company is ready. The next step is to go ahead and try all the latest and greatest AI models out there — from rolling their own to using companies that specialize in machine learning.
Different Roles in the Data Pyramid
As companies’ needs differ, so do their staffing strategies. Some prefer specialists (using different people for each layer of the pyramid) whereas others prefer generalists (asking 1–2 people to own the large portions of the project). The specialist model gets stronger results at each stage but requires huge communication overhead to ensure proper building across many people. The noticeably small overlap in responsibilities breeds clear, smooth ownership but also makes it extremely difficult to speak the same language throughout the entire pyramid. Meanwhile, the generalist model allows people to build data science products (i.e., dashboards, models, etc.) quicker but can often land on local maxima. Honestly, achieving a high-quality generalist model is also really hard to hire for since it’s so rare for someone to have such a diverse background.
Specialists used for the first three stages in the pyramid are Data Engineers, Data Architects and some may call them Big Data Experts. Their job is to get the data into a unified platform using various tools and clean/transform it into the right format for data science consumption. The next stages are handled by Data Scientists, whose job is to access the data from the platform and build models to derive insights and aid business processes in a company. Some of the tasks may have a slight overlap between different teams, but most of the time they perform independent tasks depending on their role to achieve the best possible result for the company.