Apparently, this is considered a “classic” and “hard” interview problem. I don’t really care about that. But I can say I’ve definitely been asked this question in interviews (yes, plural).
Outside of that context, I actually find the problem itself very interesting. So let’s explore it.
Task
The task definition is straightforward. We have to design a data structure that supports two operations:
Add a new [integer] number, and
Return the median of all numbers added so far.
In other words, the task can be summarized to implementing the following:
class MedianStream {
public:
// Constructor
MedianStream();
// Adds integer value to the stream
void addNumber(int x);
// Returns median of all numbers seen so far
double getMedian();
};
For fun’s sake, let’s assume that every call to addNumber
is immediately followed by a call to getMedian
. This makes the problem get closer to its worst-case, and also makes our runtime analysis a bit cleaner.
Example
MedianStream* medianStream = new MedianStream();
// List is empty
medianStream->addNumber(10);
// Sorted list is now [10]
medianStream->getMedian();
// Returns 10
medianStream->addNumber(20);
// Sorted list is now [10, 20]
medianStream->getMedian()
// Returns 15
medianStream->addNumber(0);
// Sorted list is now [0, 10, 20];
medianStream->getMedian()
// Returns 10
I’ll talk about four different approaches to this problem here. The first two are unequally bad. The last two are “equally” good. If you face this problem in an interview (as I have multiple times before), I strongly advise to go with the better ones.
Lazy brute force
I don’t like calling approaches lazy, but this one actually deserves it. It is a lazy brute force because it consists of the laziest (in terms of brain-power) implementations of both functions that interest the problem.
void addNumber(int x)
: Simply append the valuex
to some listvalues
.double getMedian()
: Find and return the median ofvalues
. How? Sort the whole list and get the median!
class MedianStream {
public:
// Constructor
MedianStream() {
numValues = 0;
}
// Adds integer value to the stream
void addNumber(int x) {
numValues++;
values.push_back(x);
}
// Returns median of all numbers seen so far
// Assuming this function is not called on empty lists
double getMedian() {
// Sort values
sort(values.begin(), values.end());
// Calculate and return the median
if (numValues % 2 == 0) {
return (values[(numValues-1)/2] +
values[numValues/2])/2.0;
} else {
return values[(numValues-1)/2];
}
}
private:
int numValues;
vector<int> values;
};
Runtime
The runtime of addNumber
is great. We’re just adding a number to the end of a list, so it’s O(1)
.
The runtime of getMedian
sucks. That little call to the sort
function is the bottleneck. Remember that sorting a list of N
numbers costs O(N log N)
. And that’s per call of getMedian
!
Hence, calling addNumber
followed by getMedian
N
times will lead to a combined runtime of O(N * 1 + N * N log N)
, totaling at an unfortunate O(N^2 log N)
.
Not a good way to pass an interview.
Brute force (but not lazy)
The great Yogi Berra said “you can observe a lot by just watching.” Indeed, if you look into what the current implementation of getMedian
is doing, you’ll observe that we only care about one or two values of the list.
With that in mind, sorting the entire list before accessing one or two positions is not only overkill, but it’s also costly. If all we want to do is find the k
-th smallest element of a list, the Quickselect algorithm will give you that answer in linear time. In C++, you could call the function nth_element.
I’ll not go into detail for the idea and implementation of those algorithms since they are not the main part of this note. However, I really recommend checking them out. They might come in handy (and they certainly have to me.)
Runtime
Nothing changed in addNumber
from the previous implementation. It still sits at a wonderful O(1)
.
But now the runtime o getMedian
sucks less! The one or two runs of the Quickselect algorithm will take linear time (O(N)
), which is an improvement from the previous O(N log N)
. Again, that’s O(N)
per call of getMedian
.
In the end, N
calls of each function (alternating) will total O(N*1 + N*N)
, or simply O(N^2)
.
We can do better.
This is a picture of my dog, Fox, to serve as break before heading to the next ideas.
The actual solution
Let’s make Yogi Berra proud and watch a little more.
Every time we query for the one or two elements needed to find the median, we always query for the “same” position in the list. That is, we are always trying to figure out the middle of the list (we are interested in the median, after all.)
Also, first or second-year CS students learn about basic data structures that let us do insertions while preserving some type of order. The most popular ones are ordered sets and heaps. Those structures can only efficiently give you the smallest/largest element in a list. If you try to find out the middle element of a set, you’re gonna have a bad [run]time. Let’s assume then we can only know the smallest or largest element of a list, and solve the problem with that tool.
Finding the middle
Forget about the problem setup for a second.
Say we have a list S
. Let’s split the list in two halves, and put the smaller elements on a list L
and the largest at R
. If S
originally had N
elements and N
is even, L
and R
each have N/2
. If N
is odd, let’s say L
has the smallest (N+1)/2
and R
has the largest (N-1)/2
elements.
How do we find the median of S?
If N
is odd, the median of S
is the smallest (N+1)/2
-th element (1-indexed). This is the largest element of L
!
If N
is even, the median will be the average of the smallest N/2
-th and (N+2)/2
-th elements (1-indexed again). In other words, this will be the average of the largest element of L
and the smallest element of R
.
Thus, what we need to efficiently find a median is for L
to provide us with its largest element quickly, and for R
to give us its smallest element. We can simply make L
be a max-heap and R
be a min-heap.
Updating L
and R
Since L
and R
were defined as the left and right halves of the original list, there are two properties in them we actually care about:
No element in
L
is greater than an element inR
, and0 <= size(L) - size(R) <= 1
. This is equivalent of the previous definition we used of allocating the elements as a function ofN
. However, this doesn’t make us depend onN
, which is nice.
Assume those two properties are currently being respected. When a new element X
arrives, we want to add it to either L
or R
, and make sure those properties continue being respected. This is an overview of the algorithm:
If
X > max(L)
, add it toR
. Otherwise, add it toL
.There’s a corner case to watch out here: if
L
is empty (as it will be in the beginning of the algorithm),max(L)
would be undefined. We handle this case by simply addingX
toL
when it is empty.
If
size(R) > size(L)
, addmin(R)
toL
, and then remove it fromR
.If
size(L) > size(R) + 1
, addmax(L)
toR
and remove it fromL
.
The first bullet point guarantees property 1 is respected. The second and third bullet points are updating the lists in order to enforce property 2 (without breaking 1).
Note that, just as we expected, we only need to interact with the largest element in L
and the smallest element in R
. This makes our use of min and max-heaps appropriate.
Ok, where’s the code?
Here:
class MedianStream {
public:
// Constructor
MedianStream() {
numValues = 0;
}
// Adds integer value to the stream
void addNumber(int x) {
numValues++;
// The algorithm described above
if (!leftHalf.empty() && x > leftHalf.top()) {
rightHalf.push(x);
} else {
leftHalf.push(x);
}
if (rightHalf.size() > leftHalf.size()) {
leftHalf.push(rightHalf.top());
rightHalf.pop();
}
if (leftHalf.size() > rightHalf.size() + 1) {
rightHalf.push(leftHalf.top());
leftHalf.pop();
}
}
// Returns median of all numbers seen so far
// Assuming this function is not called on empty lists
double getMedian() {
// Handling cases for odd and even numValues
if (numValues % 2 == 0) {
return (leftHalf.top() + rightHalf.top())/2.0;
} else {
return leftHalf.top();
}
}
private:
int numValues;
// This is a max-heap in C++:
priority_queue<int> leftHalf;
// This is a min-heap in C++:
priority_queue<int, vector<int>, greater<int>> rightHalf;
};
I heavily used the priority_queue structure from C++. Most languages have some default min/max-heap. Coding your own heap is not hard, but why would you do that?
Runtime
It may feel like we took a step backwards with addNumber
. Before, without any thinking whatsoever, we had an O(1)
implementation for it. Now, we have these [constant number] of heap pushes and pops. Remember that each of those operations takes O(log N)
time. Thus, our new version of addNumber costs us O(log N)
.
We improved getMedian
by as much as we possibly could. Its very short implementation is not deceiving. All we do is query for one or two roots of priority queues. Those happen in constant time. Then, this amazing version of getMedian
costs… O(1)
.
After N
operations of each of them, we’ll settle at O(N log N + N * 1)
, or simply O(N log N)
.
This is how you pass an interview.
Less code and less elegance
Don’t get me wrong, I love elegant solutions like the one above. It’s nice solving “complex” problems using only basic tools. Any introductory course to data structures will teach heaps, and that is literally all you need to solve this “hard” problem.
But I’m also a competitive programmer. When push comes to shove, I’d gladly throw away all my finesse if can save some lines or thinking time. Especially if the runtime is unaffected in the process.
Basically, what I’m saying is that this problem could be solved by importing a data structure. Yup. Take a look at the Wikipedia page on the Ordered Statistic Tree. It supports:
Insertions of numbers in
O(log N)
.Finding the
k
-th smallest number inO(log N)
time.
It even feels like it was tailor-made for this problem. In fact, the ordered statistic tree supports even more operations than those two.
Another data structure that would immediately solve this problem (by supporting the two previous operations in O(log N)
time) is the Treap. In my previous experience, it is much easier to understand a Treap than to actually implement one. Be my guest and have fun coding one.
Overall
This problem can be solved in O(N log N)
time for N
operations of each kind (alternating). There is a solution I find beautiful, and one that is less so. There are also brute-force solutions, but they are for losers. You’re not one.
I believe beauty in problem-solving has a time and a place. This blog is the place for it. I love the fact that this problem can be solved with primitive tools, rather than using complicated ones. When it’s your turn to solve this problem, feel free to choose the solution you want. I’m not your dad.
Just make sure it is fast.
amazing post dude. thanks for sharing :D pls solve all my future problems