Building Statistical Thinking in order to Predict Future — Machine Learning
How important is it for machine learning engineers to turn data into knowledge?
How important is it for machine learning engineers to turn data into knowledge?
As we all know and have been taught from the beginning that knowledge is the most precious thing above all, in our case, knowledge is much more precious than data and is much more precious to come by.
I will discuss major statistical concepts that will open your mind to statistically approaching any data or problem.
Probability is the study of random events. Many people intuitively understand degrees of probability, but we will talk about how to make quantitative claims about those degrees.
Statistics is the discipline of using data samples to support claims about populations. Most of the statistical analysis is based on probability.
Computations are a tool that is well-suited to quantitative analysis, and computers are used to process statistics.
I am writing about an experience that leads me to an interest in Statistics, the question was, do Aliens exist or not? If you Google this same question, you will find plenty of discussions. Some people claim it’s true, others say it’s a myth.
In many of these discussions, people provide data to support their claims. Reports and claims like these are called anecdotal evidence because they are based on data that is unpublished and usually personal. But we might want evidence that is more persuasive and an answer that is more reliable.
Because of this, anecdotal evidence usually fails, as:
- Small numbers of observations
- Selection Bias — People who join this discussion of this question might be interested because they believe aliens exist. In that case, the process of selecting data would bias the results.
- Confirmation Bias — People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
- Inaccuracy — Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.
So, just to solve this problem and have a better knowledge of the same question, we will opt for a statistical approach.
Statistical Approach
To address the limitations of anecdotes, we will use the tools of statistics, which include:
- Data Collection — We will use data from a large national survey that was designed explicitly with the goal of generating statistically valid inferences about the population interested.
- Descriptive Statistics- We will generate statistics that summarize the data concisely and evaluate different ways to visualize data.
- EDA — We will look for patterns, differences, and other features that address the questions we are interested in. At the same time, we will check for inconsistencies and identify limitations.
- Hypothesis Testing — Where we see apparent effects, like a difference between two groups, we will evaluate whether the effect is real, or whether it might have happened by chance.
- Estimation — We will use data from a sample to estimate characteristics of the general population.