Kirill Eremenko: Confident Data Skills; Master the fundamentals of working with data and supercharge your carrer

In today’s Digital Age, data is used to shape the tales of who we are, how we present ourselves, what we enjoy and when we want things. Data is everywhere around us. Actually, we are data too. Our DNA provides the most elementary forms of data. We are vessels of this data, walking flash drives of biochemical information, passing it on to our children and “coding” them with a mix of data from us and our partner.

We are not only data, but we also create data. When we interact with any touchpoint that collects data, we give out data about our self, that is called “data exhaust”. It has always been this way; we’ve just become better at recording and collecting it.

Data is the past and past is data. Data is every unit of information. It is the by-product of any and every action. Even if it is not collected or stored in a vault it is still data. Majority of data is actually not collected or stored. But that is actually changing. We are now dramatically better in collecting, organizing, analyzing and visualizing data. Even if data can never be the future, we can make insights and prediction about it.

Big data is the name given to datasets with columns and rows so considerable in number that they cannot be captured and processed by conventional hardware and software within a reasonable length of time. In order to labeled dataset as bigdata, it has to have at least one feature: its volume, its velocity or its variety.

Storing data today is mainly done in digital format. With growth of big data, in order to process large sets of data properly – to make connections between elements, and to use those connections to make accurate and meaningful predictions – scientist need to build information carriers that could both manage the data and handle its storage.

Using data to understand human behavior is not killing creativity, like some will say. Even if machines can drive our purchase, we still hold the most valuable information: human desire.

If we look on how data fulfils our need, we can actually use Maslow pyramid. Data is used for air control and in food production. Data is used in medical area for improving results. Data are used in digital age for creating social connections through social networks. Data is used also for checking on people performance, creating feedback and giving people information on where they excel and where they may need further training.

If you want to start in data science field, you should follow some steps. When you first start, you should take time to acknowledge where your interests lie, whether that is visualization or machine learning, before you pursue in specific area. Then you should work on your creative capabilities. Data scientist must get into the mindset of asking the right question of their data. Data science people are not only coming from technical areas. A “raw” data science may be able to play with material, but a data scientist with the right background will be able to ask the right questions of the project in order to produce truly interesting result. If you really want to be a great data scientist, you have to be able ask the right questions. This is even more important when we work with unstructured data. That is why companies prefer to work with subject matter experts in that area. Data science has significantly improved the techniques for companies accessing and analyzing media. Today we can see real-time results, custom reporting through charts and graphs, filter your data to uncover trends by demographic and text analysis. When you start working in data science, you should practice as much as possible. But whatever you do, have ethics in mind.

Business intelligence is not the same as data science. It doesn’t carry detailed investigative analyses on the data, it simply describes what has happened, in a process that we call “descriptive analytics”. If we look at analytic value escalator, we can estimate four stages:

Descriptive analytics: What happened?
Diagnostic analytics: Why did it happen?
Predictive analytics: What will happen?
Prescriptive analytics: What should we do?

BI covers fist stage and science can help us tick boxes in final three. Problem with BI is that data often comes second to an updated report. BI’s traditional dependence on Excel can teach you bad habits. Excel can have effect of over simplifying things. No data science tool will allow you to mix data with logic. In any database management system, data and logic must be considered separately.

Computers still didn’t reach their processing limits, but we have. The machines are only waiting for three things: access to data, access to faster hardware and access to more advanced algorithms. On the other hand, in data lies potential. It will always tell us something, whether information is new or not.

The Data Science Process has five stages:

Identify the question.
Prepare the data.
Analyze the data.
Visualize the insights.
Present the insights.

But before everything we need to think about how to gather data. Since with data, we have the advantage of deriving our insights from actual evidence. In digital age, we can create digital information. They can be used, recaptured, restructured and excerpted, but at the end you can still return to an earlier version once you have finished and start again. In order to use them properly, we need to make sure that datasets are “cleaned up”. Data can only speak through a machine or piece of software. It does not have a “language” of its own.

Identifying the question:

Before we can prepare and analyses our data, we must know what kind of data we need. Question need to be understood, deconstructed and analyzed. Blindly taking all business issues as project questions, can be disastrous for data scientist. It falls to data scientist to identify all the parameters of the business problem because they have been trained to do so. Understanding not only what the problem is but also why it must be resolved now, who its key stakeholders are, and what it will mean for the institution when it is resolved, will help you to start refining your investigation.

Data-driven projects will often affect more than one area of a company. When working on a project you can use bottom-up or top-down approach. Bottom-up approach has its basis in facts. It is possible to reach a conclusion much faster that if we use a top-down approach. However, any project investigator using a bottom-up approach will tell you that it is near impossible to drive change with this method. Companies don’t just run on data, they run on people and relationships. The numbers are just one piece of puzzle, we also have to understand a company’s culture, its mission and its people.

When gathering information, we need to understand what data we actually have and whether more data gathering is needed. The bottom line for any data science project it to add value to the company. So, we need to understand problem, before we do data mining and collect the data. While all data has potential to be useful, we cannot deploy all the information we have for every problem. That is why it is important to understand problem, so that we can eliminate data that is irrelevant to our question. The most important questions for data gathering are “Where are our sources?” and “Do we want to use quantitative or qualitative research methods to interrogate them?” Quantitative methods should be used when we want to gather statistical information, qualitative methods use the open-ended questions that can result in an infinite number of answers. Before even going further with project, you should get stakeholders buy-in, preferably in writing.

Data preparation:

Preparing data is all about establishing a common language between human and machine. With so many different people working on a single dataset and using different methods for adding data, the resulting datasets in many organizations are unsurprisingly riddled with errors and gaps. So, it is a job of data scientist to prepare data, so that it can be understood by machines.

Data preparation or data wrangling usually takes lot of time. In the real-world data are usually dirty, messy and corrupt. For proper analysis, data should be:

In the right format
Free from errors
Have all gaps an anomaly accounted for

Method for preparing data is called ETL.

Extract the data from its source
Transform the data into a comprehensible language for access in a relational database
Load the data into the end source

End source is usually data »warehouse«. Data warehouse stores otherwise disparate data in a single system. Often times, it will comprise relational database. In relational database, the relationship between the units of information across datasets matter. The datasets in relational database are linked by columns that share the same name. If multiple datasets contained columns with the same header, data from those columns could be compared across them in a relational database. When comparing them to excel, relational database:

Maintains integrity
Combines datasets
Is scalable

When extracting data, we need to do this in order to ensure we don’t alter the original source and because data usually come from different locations. The simplest type of raw data are csv files (comma separated values). Notepad++ and EditPad Lite are two tools author use to work with raw data.

The transforming step includes alternations such as joining, splitting and aggregating data, but the most important function of transformation is data cleaning. When we talk about dirty data, we can have incorrect, corrupt or missing data. In order to fix corrupt data, we can: re-extract it from its original file, talk to the person in charge of data or exclude the rows that contain corrupt data from analysis. For missing data, we can: predict it with 100% accuracy, leave the record as it is, remove the record entirely, replace the missing data with the mean/median value, fill in by exploring correlations and similarities or introduce a dummy variable for missing data. When dealing with data, we should be careful about outliers, data that are outside normal distribution. When you load your data into warehouse, do final check if loading was done properly. Quality assurance is one of the most important aspects of data preparation.

Data Analysis:

This is the most technical part of data science. The most basic algorithms can be divided into three groups:

Classification
Clustering
Reinforcement learning

Classification:

We use classification when we already know the groups into which we want analysis to place our data. We can create classification algorithm only through our case history. Based on order of difficulty we can talk about those classification algorithms:

Decision Tree
Random Forest Regression
K-Nearest Neighbors (K-NN)
Naive Bayes
Logistic Regression

A decision tree functions similarly to a flowchart. It runs test on individual attributes in your dataset in order to determine the possible outcomes and continues to add results as further tests are run, only stopping when all outcomes have been exhausted. In the world of business, decision trees can be used to classify, say groups of customers

Random forest classification builds upon the principles of decision trees through ensemble learning. Instead of there being only tree, a random forest will use many different trees to make same prediction, taking the average of individual trees’ results. To make the decision trees unique, they are created from varying subsets of the dataset. Steps to random forest classification are:

Choose the number of trees you want to build
Fit the classifier to the training set

Random forest is used over decision tree when you work with large dataset.

K-nearest neighbors uses patterns in the data to plant new data points in the relevant categories. K-NN analyses “likeness”. It will work by calculating the distance between your new data point and the existing data points. K-NN assumes that even unknown features of patients will be similar, provided that some known features are alike. Steps to K-NN:

Choose the number of neighbors k for your algorithm.
Measure the (Euclidean) distance between the new data point and all existing points.
Count the number of data points in each category.
Assign the data point to the category with the most neighbors.

While K-NN is a good method for making accurate predictions, it is important to note that its result will not be correct every single time. The main disadvantage of K-NN, is that it takes a very long time to compute.

Naive Bayes is named after the Bayes theorem, which enables mathematicians to express probabilities of events in such a way that any newly uncovered evidence can be easily included in the algorithm to dynamically update the probability value. In Bayesian statistics we have probability, conditional probability, prior probability and posterior probability (this is the one we are interested in calculating). Why is Bayes formula so important? Because ignoring the bigger picture can lead to hasty and oftentimes incorrect conclusions. We must update Bayes formula as new evidence comes in. Only in this way we can ensure that we have the most up-to-date picture of a problem. Sometimes we may need to actively seek new evidence to help more accurate conclusions. Naïve Bayes relies on a strong, naive independence assumption: that the features of the dataset are independent of each other. Despite this naive assumption, the Naive Bayes algorithm has proved to work very well in many complex applications such as e-mail spam detection. Naive Bayes uses data point’s variables to place it into the best suited class. Steps are:

Work out the prior probability
Calculate the marginal likelihood
Calculate the likelihood (in Bayes likelihood is conditional probability)
Calculate the posterior probability
Derive the posterior probability of the opposite scenario
Compare the two probabilities

Naive Bayes is good for non-linear problems, where classes cannot be separated with a straight line on the scatter plot and for datasets containing outliners. The drawback to using it is that naive assumptions it makes can create bias. Here it is important to notice also, that Naive Bayes is part of probalistic family of classification algorithms. K-NN on the other hand is part of deterministic. K-NN assign a new observation to a single class, while Naive Bayes assign a probability distribution across all classes.

Logistic regression is actually not a regression algorithm: it is a type of classification method. Logistic regression has its roots in linear regression, but we need to create logistic regression function from straight regression line. We can do that even with categorical variables (only yes or no type of variables). After we created logistic regression line, we can then use the line to make predictions for new data. We find the probability for each value, but we also set up restrictions. Logistic regression is good for analyzing the likelihood of a customer’s interest in your product, evaluating the response of customers based on their demographic data and defining which variable is statistically most important.

Clustering:

If you don’t know what the groups resulting from an analysis might be, you should use a clustering technique. Clustering algorithms allows us to use data in order to discover new possibilities and patterns, to highlight new areas that we may not even have considered, rather than to merely respond to our initial question.

Two clustering algorithms are:

K-means
Hierarchical clustering

K-means discovers statistically significant categories or groups in our dataset. It is perfect in situations where we have two or more independent variables in a dataset and we want to cluster our data points into groups of similar attributes. With K-means we can tackle also more dimensional clustering.

Our first step is to select number of clusters (K) with which we will work. Then we select centroids at random K points. This is just to start somewhere. Assign each data point to the closest centroid. Determine and place the new centroid of each new cluster. In this part we calculated where new center of mass for clusters is and we move centroids accordingly. Then we reassign each data point to the new closest centroid. We repeat re-computation of centroids as long as we don’t reach a point where data points will no longer be re-assigned to centroids.

When we look for optimal number of clusters, we will use elbow method and help our self with WCSS (Within Cluster Sum of Squares). To get our optimal cluster number, we must evaluate how different numbers of clusters perform by using the WCSS. As the number of clusters increases, the WCSS value will decrease. When we put those values in graph, the biggest kink on the graph is a point called “elbow”, and that is the optimal number of clusters.

There are two types of hierarchical clustering: agglomerative and divisive. Agglomerative is using bottom-up approach, working with single data point and grouping it with the nearest data point until all points are in a single cluster. Divisive begins on top and works its way down by splitting the single cluster apart. The process for both types of hierarchical clustering is recorded in something called a dendrogram. The distance between clusters can be set as: distance between their centers of mass, distance between their two closest points, distance between their two furthest points and averaged between previous two options.

Steps to agglomerative hierarchical clustering are:

Make each data point a separate cluster
Combine the two closest clusters (keep repeating this step until only one cluster remains)
Set a threshold

This type of clustering maintains a record of each step in the process. The biggest advantage of using the hierarchical clustering algorithm is its dendrogram. The dendrogram is a practical visual tool which allows you to easily see all potential cluster configurations.

Reinforcement learning:

Reinforcement learning is ultimately a form of machine learning, and it leans on the concepts of behaviorism to train AI and operate robots. It utilizes concept of association of success with rewards and unsuccessful progression with a punishment. One method called UCB (upper confidence bound) can organize an approach to finding optimal result by running tests dynamically, combining exploration (random selection) with exploitation (selection based on prior knowledge). Not like with A/B testing that is mainly used in business, where you can make decision only when you test all the options once.

Two of the most used algorithms of reinforced learning are UCB and Thompson sampling. UCB is deterministic and Thompson is probalistic. Reinforcement learning starts with no data at all. It is dynamic strategy that increases in accuracy as additional information is collected.

Steps to UCB are:

Assume a starting point
Establish the initial confidence bound
Conduct trial rounds
Isolate and exploit optimal solution

We have two categories true expected return (this is theoretical), how should it be, if things will happen a lot of times (low of large numbers); and observed returns (what really happens). The UCB is good for finding the most effective advertising campaigns and managing multiple project finances.

Steps to Thompson sampling:

Conduct trial rounds – probability distribution curve is established to estimate where the true expected return might be.
Take random values from the distributions.
Run optimal option
Continue to run rounds in order to refine constructed distribution curves

The Thompson sampling is good for finding the most effective sales funnels and processing large amounts of customer data to ascertain the best-performing advert. And it can do time delays.

Data visualization:

Data visualization is not just about making pretty pictures. Visual analytics are often considered the intermediary stage between data analytics and data visualization. Oftentimes, visualization software will have features for filtering and isolating key trends in our data. By throwing the data into Tableau, Power BI, or any other visualization software, we can come across some great insights that might otherwise have been harder to see. Data visualization it the process of creating visual aids to help people see and understand information. It couches our data in a context. If visualized well, BI dashboards will engage and persuade your audience to make the changes that you suggest. Since we are living in a very visual word, people today expect to process information in visual way.

Often virtualization require certain reductions in order for the message to be most effective. What visuals will never do is add information (if they do, you are manipulating your data, which goes against everything in the field). Visualization give context to numbers. When we are compressing our data into visual bites, we will inevitably lose a little information. Therefore, it is important to ensure that we are not actually tampering with the truth when we finally come to develop our visuals.

One of the tools of visualization is Sankey diagram, that shows movement of data through the size and directions of its arrows. This approach is perfect for visualizing any kind data flow – whether it is users going through a sales funnel or immigration patterns.

Data presentation:

How to create a good presentation: