memnus | For the probabilitists in the audience

Current Music: Semisonic - Secret Smile

Entry tags:

math,
random

For the probabilitists in the audience

I'd like your thoughts on a problem. (Here, all "randomly chosen" numbers are taken from the same continuous real distribution.) Suppose you have a set S of n randomly-chosen numbers, and a set D, initially empty. Begin by removing the largest number from S and placing it in D, and adding another random number to S. Now take the largest number from S that is smaller than the last number you chose, place it in D, and replace it with a random number. Repeat until you've taken the smallest number from S.

- What is the distribution of numbers in D?
- What is the expected value for the size of D?
- Bonus question (I know the answer to this one!): Why is Brian thinking about this problem?

I'm about to start in on a Monte Carlo estimator of the problem, but I'd like some theoretical input.

click

Flat | Top-Level Comments Only

Why is Brian thinking about this problem?

Oooh, I know! It's 'cuz he's a nerd, right? =o]

More seriously, my gut tells me that the expected value for the size of D is probably infinite. I was never all that good at this sort of math, though, and I'm also ridiculously out of practice. I'm still going with "infinite", though.

Oh come now, that's like saying gravity is why the plate got broken - it hardly explains who knocked it off the table. I have a specific instance of this problem.

I don't believe infinite. The number you pick is guaranteed smaller with each iteration, and once you get the smallest one you're done.

The number you pick is guaranteed smaller with each iteration, and once you get the smallest one you're done.

Yes, but you keep adding numbers to the set, and it's a continuous real distribution, so infinite isn't *that* unreasonable.

I never said I was right, or even that I'd bet on it, that was just my first reaction. I'm sure the hordes of actual mathematicians will pounce on me with their mockery masked by logic any minute now. =o]

Unless I'm confused about probability (which is very possible), I have most of a combinatorial solution to this problem. The short answer is: don't bother, because it's bloody awful (except maybe for the purpose of proving Dan wrong with some slightly-less-awful estimation, but I'm not even quite to the point where that's possible yet).

The long answer follows.

First, the thing I might be confused about: you can simulate choosing $n$ numbers uniformly from an interval $[0,\alpha]$ by first choosing the largest of them (say $y$) with a pdf proportional to $x^{n-1}$, and then picking the rest of them uniformly from the interval $[0,y]$. If this is false, my entire argument is hosed, and it's the kind of thing I get confused about easily, but I'm pretty sure it's true.

Now, if the number you're adding to $S$ is ever larger than something you removed, you can totally ignore it, thereby reducing the problem to something like its instance with $n-1$ numbers. (I say "something like" because you're more likely to generate something that gets thrown away, since you're picking from a larger interval. This is a pain to fix, so I'll just worry about how long it takes to get rid of the first guy.)

This leads to the following way of rephrasing the problem:

1. Generate the largest number $y_0$ of $n$ things in $[0,1]$. Put it in $D$.
2. With probability $1-y_0$, be done (i.e., reduced to $n-1$ things). Otherwise, generate the largest number $y_1$ of $n$ things in $[0,y_0]$.
3. With probability $1-y_1$, be done. Otherwise, generate the largest number $y_2$ of $n$ things in $[0,y_1]$.

So we just need to evaluate the expected value $E_k$ of $1-y_k$ for each $k$. This is given by
$$
E_k=\frac{\int_{y_0=0}^1 \int_{y_1=0}^{y_0} \dots \int_{y_k=0}^{y_{k-1}} (1-y_k) y_k^{n-1}y_{k-1}^{n-1} \dots y_1^{n-1}y_0^{n-1} \, dy_k \, dy_{k-1} \dots dy_1 \, dy_0}{\int_{y_0=0}^1 \int_{y_1=0}^{y_0} \dots \int_{y_k=0}^{y_{k-1}} y_k^{n-1}y_{k-1}^{n-1} \dots y_1^{n-1}y_0^{n-1} \, dy_k \, dy_{k-1} \dots dy_1 \, dy_0}
$$
which ends up giving $1-\prod_{i=1}^{k} \frac{in}{in+1}$. So, if we've reached the $k^{\rm th}$ step, we'll continue to the $(k+1)^{\rm st}$ step with probability $\prod_{i=1}^k \frac{in}{in+1}$; you can use this to find the expected time it takes to eliminate a number from consideration.

In particular, the odds of continuing are always smaller than $\frac{n}{n+1}$, so the expected number of steps before we lose a number is definitely less than $n+1$.

(Unfortunately, once we lose a number this analysis doesn't quite apply. Intuitively, though, it should not-apply by some constant factor, and so in particular the number of steps required to finish should be finite. Making this rigorous is left as an exercise to the reader.)

Two things:

1) Due to an obiwan error, that bound should be $n+2$ (since we don't start culling until step 2)
2) Numerically, the expected amount of time it takes to go from 2 to 1 relevant things appears to be about 3.3 steps (3.279 after taking into account results of no more than 9 steps). This is probably a good way to sanity-test my analysis against your Monte Carlo guy.

I don't believe Dan. Suppose we set n=1, then it's pretty clear that we expect to collect the first element, and there is a 50% (I'm guessing) chance of replacing it with something smaller, and we can continue this argument ad infinitum and we expect to collect on average 2 elements.

So, the probability of adding an additional useful element is x (assuming we populate S with elements between 0 and 1). Unfortunately, I don't know how fast we go through x, so I can't really do the integral. I think you expect to through x at a rate of around x * (S-1)/S. This should allow for a diffeq to be solved or somesuch, but this is all just hacked together thoughts near midnight.

Unless I'm confused, if $n=1$, you take the largest element of $S$ and it is the smallest element of $S$, so you're immediately done.

With the problem as defined, you're right (immediately done). With the problem as "stylistically feels right" - you stop when there are no elements in S that are smaller than the item you just took (so this check happens after replacement), the other approach is appropriate.

Flat | Top-Level Comments Only

For the probabilitists in the audience

no subject

no subject

no subject

almost...

Re: almost...

no subject

no subject

no subject