algorithmmachine-learningdeep-learningprobabilityprobability-distribution

How does the joint probability distribution help to generate things?


I am trying to understand the difference between discriminative models and generative models. One of the helpful answers at Stack Overflow is here: What is the difference between a generative and a discriminative algorithm?

In the top answer (see the above link), there was a simple example where there are only four data points of the form (x,y). The author of the answer said the following: The distribution p(y|x) is the natural distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called discriminative algorithms. Generative algorithms model p(x,y), which can be transformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example, you could use p(x,y) to generate likely (x,y) pairs.

I don't quite understand how one could use p(x,y) to generate likely (x,y) pairs. I would be interested to see an example of (x,y) pair that is generated by using the joint probability distribution p(x,y)? Also, why can the conditional probability distribution p(y|x) not be used to generate new pairs?


Solution

  • Here is a simple, but real example to illustrate the concepts.

    Brown eyes are dominant over blue eyes. My grandmother had blue eyes. Her husband came from a family with only brown eyes as far back as you go. My father, likewise. I had two children with a blue-eyed woman. Let's let x and y be the eye color of those children, with the eldest child being y and the younger being x. And to discuss the underlying genetics, let's use B for the gene for brown eyes, and b for blue.

    First let's figure out p(x, y). My mother got one gene from her mother and father, and so must be bB. She had brown eyes. My father's were BB. He also had brown eyes. Depending on what my mother gave me, I have even odds of bB and BB. Whichever one I actually have, I have brown eyes.

    I then had children with a blue-eyed woman with genes bb.

    IF my genes are BB, then my children are both bB and will have brown eyes.

    IF my genes are bB, then each child has even odds of bb and bB, independently of each other. And therefore my children could come out (blue, blue), (blue, brown), (brown, blue) and (brown, brown) with equal likelihood.

    When you add it up, here are the odds we get:

    (blue, blue): 1/8
    (brown, blue): 1/8
    (blue, brown): 1/8
    (brown, brown): 5/8
    

    That is p(x, y). Let's show how to generate a pair.

    First work out the cumulative probabilities. In that order it is 1/8, 1/4, 3/8, 1. Now just roll a random number. I just got 0.7284333516674881. Comparing to the cumulative probabilities, I got (brown, brown), so that's what I generate. (Funnily enough, that's also the real eye colors of my children! Not a giant coincidence, but still...)

    What random number corresponds to what output will change if I changed the order in which I list the probabilities. But the outputs will come up with the right frequencies no matter how I do it.

    Now let's work out p(x | y). If you use Bayes' formula you can verify:

    (blue|blue): 1/2
    (brown|blue): 1/2
    (blue|brown): 1/6
    (brown|brown): 5/6
    

    From this, we can figure out what the color of the second child's eyes are likely to be, given the color of the first child's eyes. This is exactly what we need for a discriminative algorithm.

    But we have absolutely no idea how to tell from these numbers whether, a priori, the odds of the first child having brown eyes are 1/2 (naive guess) or 3/4 (the real answer). If they were 1/2 then our first table would have been:

    (blue, blue): 1/4
    (brown, blue): 1/4
    (blue, brown): 1/12
    (brown, brown): 5/12
    

    Obviously this is going to generate a very different distribution than the real one. It just happens to give the same p(x|y). And so we need more information for the generative algorithm than the discriminative. Specifically, we need to know p(y).

    Does that clarify things for you?