(open-source photo above by mattbuck, 2009, cited in References below)
What is the probability that there is a certain "state of the world" given a particular indicator observation? How can prior knowledge of the world be applied to present-moment probabilities? Bayes Theorem has been applied to various types of small data sets for years, and its application applies to educational data as well. This presentation introduces what Bayes Theorem says, and then demonstrates how it can be applied through the RapidMiner Studio to understand probabilities and likelihoods. This will also address what the “naïve” means in a Naïve Bayes Theory application…and what non-naïve Bayes computations may look like.
General Probability Representation
Reverend Thomas Bayes (1701?-1761) created the conditional probability theorem in his "An Essay towards solving a Problem in the Doctrine of Chances" (1763). ["Doctrine of chances" apparently means "the theory of probability" ("An Essay towards...," Sept. 2, 2018). His ideas were mathematized by Simon Laplace.
[The probability of "A" given "B" equals (the probability of "B" given "A" multiplied by the probability of "A") divided by the probability of "B."]
The "prior" is the P(A) or the initial "degree of belief" in "A" and its likely probability in the world (based on some probability distribution, whether uniform or not).
The "posterior" is the degree of belief in "A," given the observation of "B." (The general assumption is that "A" and "B" have some association or interrelationship.) It is the "prior" x "likelihood." This is what the equation is solving for. This is a joint likelihood.
"Priors" modify the understanding of the world...and the ensuing "posterior probability". The presence of "B" indicates "the support B provides for A." ("Bayes' theorem," Apr. 29, 2019) Priors are the beliefs about a probability for seeing a certain state of the world ("A"), based on given knowledge and information at the time (including no information).
Prior distributions (aka "conjugate priors") are probability distributions that describe the resting-state probabilities, before possible changes or perturbations. A non-naive view is that variables in a space mutually affect each other; a naive view is that variables exist independent of each other. Oftentimes, variables are somewhat interdependent and sometimes even coupled/interlinked.
And in a visual "set" sense...
The sizes of the shapes are not representative of potential sizes of the respective sets. This visual does give the sense of joint probabilities, the existences of "A" and "B" both alone and in joint occurrence with the occurrence of the other.
A relate-able walk-through story-problem... (one use case)
What is the probability of a person graduating with a doctorate degree (A) given the observation of (B) where (B) is...
- signing up to a doctoral program (better than 0 but not by much)
- (paying all necessary fees)
- (making adjustments to personal lives)
- (not engaging in disqualifying moral and other behaviors)
- completing the required coursework
- taking the general exam and passing
- proposing their dissertation research and having it accepted by the committee
- conducting their doctoral research successfully
- writing up their doctoral research successfully
- defending their doctoral research successfully
- completing the necessary paperwork to graduate (close to 1 or 1)
Each additional "prior" belief informs the "posterior" probability and may enhance the accuracy of our probabilistic assumptions.
Per the above, the population pipeline for the doctorate degree dwindles (fewer people make it to the next step, but each step is required), but the more a person advances, the greater the likelihood that they will achieve their doctorate (and the closer the individual is to the goal). The higher the probability they will succeed even in a complex and serendipitous and chance-driven world...and even in a world populated with other human agents (with their own interests, some of which are in competition with yours...and some of which are in alignment with yours).
Each step brings the individual closer, but nothing is done until it is done.
Overall, only 1/2 students finishes a doctoral degree (as a fairly stable statistic)...across the doctoral degree programs in the U.S. And so doctoral degree programs are much tougher than others.
Light critique of the example: The above is a non-classic example...because the posterior (in the equation) is about the actual state of the world at that moment...not probability of an imminent or potentially imminent event. A more slice-in-time approach could be something like this: A person is a professor at a research institute (B), and the likelihood of her having a doctorate is (A).
Above, this story does also show a little bit about probability that is dependent on multiple issues: event probability * event probability * event probability = a lower probability because of the dependencies.
Also, this gives a sense of another idea. In Bayes Theorem, the observable state "B" may indicate the hidden state "A". This means that latent or hidden issues may be estimated (with varying levels of confidence) given Bayes Theorem.
In many cases, Bayesian analysis is applied when data points are themselves sparse. The indicator variable (B) should be sufficiently evocative to link to (A).
What was revolutionary about Bayes Theorem were the following...
- the ability to inform an unknown parameter by using an indirect indicator (figuring the probability of "A" given the observation of "B")
- the harnessing of "priors" evidence to moor or ground a projected approximate ≈ probability in observable "known" probabilities and facts
- the definition and usage of prior beliefs to inform a probability and to incrementally change those prior beliefs to updated ones (a posteriori ones, reasoning from observations) with new information
- the ability to set a probability baseline for a phenomenon based in part on historical observations (to understand the past, present, and the near-future, at least)
- the mathematization of the expression of inferences and beliefs to enable additional precision and instantiation in processes and software
- the ability to bring in fresh and novel combinations of probabilities to understand a new challenge or context and to begin to lay down probabilities (and responses)
Sometimes, the focus of the Bayesian formula is expressed as "theta" (θ), as in calculating the p(θ|B) = ... (or "the probability of theta given "B" equals...)
The best "priors" are the most informative ones about a particular phenomenon or construct.
Beliefs should be updated with new information. With each iteration, there should be increasing accuracy of conditional probability and "prediction."
Some Bayesian models explore additional dependencies, the probability of A given B, and other intermediating variables or factors upon which that relationship depends (or is affected).
Exploratory vs. Confirmatory Bayesian Analyses
An exploratory Bayesian analysis involves running statistical tests over empirical data to discover associational (or even causative) relationships between related variables in a particular space.
A confirmatory Bayesian analysis involves running statistical tests to test a hypothesis or theory, to either support the main assertions of the hypothesis or theory. For example, a relationship may be theorized between two variables, and the analysis may test whether that relationship exists or not (based on empirical data).
Assertability of Claims or Understandings based on Bayes Theorem
A reality is not always accurately predictable with probability. It is important not to overclaim.
The conditionals asserted in an inference may not be the appropriate ones.
The world is dynamic and fast-changing, so "posteriors" may date out quickly, even if they were accurate before.
Assertions from Bayesian analysis are always provisional and lightly held. The more that is at stake, the more exploration there should be using a variety of methods.
"Priors" and such beliefs may have no basis in ground truth and fact. Domain knowledge is important here.
Real-world observations may help validate prior conditional probabilities and predictions.
Data Applications of Bayes Theorem
Bayesian Data Analysis
"Bayesian data analysis has two foundational ideas. The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities, over which we allocate credibility, are parameter values in meaningful mathematical models" (Kruschke, 2011, 2015, p. 15)
The idea is to define the full set of possibilities, remove impossibilities, and what is left is analyzed based on probabilities, the remaining combinations of which sum to 100%. (Arthur Conan Doyle: "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.") There may be outside-set possibilities, too, so the point is not to be blind-sided.
Bayes Theorem has been codified into various machine learning applications to solve so-called "classification" problems. It can be used to analyze a dataset of records (row data) and "predict" what category each record fits in based on the presence/absence or intensity of particular variables.
A "naive" Bayesian analysis is a calculation which assumes that the respective variables ("predictors" that may / may not contribute to a particular result) being analyzed are independent of each other (i.e. that there is no interdependence, no covariance, or that one variable does not somehow have an influence on the others). This assumption enables the calculation of the probabilities of attributes given the class (resulting in frequency distributions of attributes). This assumption is naive because co-occurring variables often do covary and do not randomly appear together.
A Naive Bayes Gaussian model assumes that the underlying frequency distribution is a normal bell-shaped distribution (even though that may not be true).
The "naivete" refers to assumptions made about the data that may not be real-world based on the respective algorithms. Researchers are supposed to qualify their assertions based on the limits of the analytics.
For small-scale data, a Laplace transform (add +1) correction is often applied to balance against the undue influence of "0"-values and null values in some datasets.
What Types of Data Work Best?
In theory and practice, the "best" data would be the following:
- data that includes a comprehensive set of variables about a particular phenomenon (the range of possibilities, to saturation)
- accurate data
- timely data
- can be qualitative (nominal) or quantitative (numerical) data, or a mix
- a larger set over a smaller one
- data that provides insights about the phenomenon or construct under study
Applied to Education Data
In terms of education data, Bayes Theorem may be used to infer classifications like the following:
- performance outcomes for learners [Given the signal (B), what is the likely performance outcome (A) for the target learner?]
- decision outcomes for learners [Given the signal (B), what is the likely decision (A) of the target learner?]
- knowledge of learners [Given the signal (B), what is the likely state of knowledge, skills, and abilities / attitudes or "KSAs" of the target learner (A)?]
SEveral Data Analytics Sequences On RapidMiner Studio
The demoed sequences use the open-source datasets built into the tool.
General Sequences: The general sequences go like this:
Outside the software, create a research design. Conduct the research, and collect data. Clean the data. Assuming you want to use various models to predict outcomes or classifications...
- In RapidMiner Studio, import the data.
- Define attributes.
- Set roles. Define the selected data column for predicted results.
- Apply Naive Bayes (or Naive Bayes Kernel).
- Apply cross-validation.
- Split the data between a training set (70% usually) and a test or validation or holdout set (30% usually).
- Set the sub-sequence for cross-validation. (Naive Bayes on the training data, Apply Model and Performance on the test data)
- Check the parameters.
- Record the parameters.
- Run the sequence.
- Check for findings.
- Check for accuracy of Naive Bayes (or Naive Bayes Kernel) model performance.
Report on the findings. Add Discussion of the findings. Present.
There are multiple right ways to set this up, and the Wisdom of Crowds feature can provide context-sensitive suggestions during the process.
The Help is (supportively) directive at each step, and wrong moves will be flagged, and suggestions will be made automatically.
An Interactive Slideshow
The following interactive slideshow allows a walk-through of the process and shows some of the resulting data visualization screens and a model validation assessment. This was based on the open-source Titanic dataset. Each slide will full-screen but requires clicking on the forward and backward arrows to navigate.