Classic Problem: Streaming Median

For some reason, this is considered hard on LeetCode

Aug 18, 2020

Apparently, this is considered a “classic” and “hard” interview problem. I don’t really care about that. But I can say I’ve definitely been asked this question in interviews (yes, plural).

Outside of that context, I actually find the problem itself very interesting. So let’s explore it.

Task

The task definition is straightforward. We have to design a data structure that supports two operations:

Add a new [integer] number, and
Return the median of all numbers added so far.

In other words, the task can be summarized to implementing the following:

class MedianStream {
 public:
  // Constructor
  MedianStream();
  
  // Adds integer value to the stream
  void addNumber(int x);
  
  // Returns median of all numbers seen so far
  double getMedian();
};

[Github gist link]

For fun’s sake, let’s assume that every call to addNumber is immediately followed by a call to getMedian. This makes the problem get closer to its worst-case, and also makes our runtime analysis a bit cleaner.

Example

MedianStream* medianStream = new MedianStream();
// List is empty

medianStream->addNumber(10);
// Sorted list is now [10]

medianStream->getMedian();
// Returns 10

medianStream->addNumber(20);
// Sorted list is now [10, 20]

medianStream->getMedian()
// Returns 15

medianStream->addNumber(0);
// Sorted list is now [0, 10, 20];

medianStream->getMedian()
// Returns 10

I’ll talk about four different approaches to this problem here. The first two are unequally bad. The last two are “equally” good. If you face this problem in an interview (as I have multiple times before), I strongly advise to go with the better ones.

Lazy brute force

I don’t like calling approaches lazy, but this one actually deserves it. It is a lazy brute force because it consists of the laziest (in terms of brain-power) implementations of both functions that interest the problem.

void addNumber(int x): Simply append the value x to some list values.
double getMedian(): Find and return the median of values. How? Sort the whole list and get the median!

class MedianStream {
 public:
  // Constructor
  MedianStream() {
    numValues = 0;
  }
  
  // Adds integer value to the stream
  void addNumber(int x) {
    numValues++;
    values.push_back(x);
  }
  
  // Returns median of all numbers seen so far
  // Assuming this function is not called on empty lists
  double getMedian() {
    // Sort values
    sort(values.begin(), values.end());
    
    // Calculate and return the median
    if (numValues % 2 == 0) {
      return (values[(numValues-1)/2] +
              values[numValues/2])/2.0;
    } else {
      return values[(numValues-1)/2];
    }
  }

 private:
  int numValues;
  vector<int> values;
};

[Github gist link]

Runtime

The runtime of addNumber is great. We’re just adding a number to the end of a list, so it’s O(1).

The runtime of getMedian sucks. That little call to the sort function is the bottleneck. Remember that sorting a list of N numbers costs O(N log N). And that’s per call of getMedian!

Hence, calling addNumber followed by getMedian N times will lead to a combined runtime of O(N * 1 + N * N log N), totaling at an unfortunate O(N^2 log N).

Not a good way to pass an interview.

Brute force (but not lazy)

The great Yogi Berra said “you can observe a lot by just watching.” Indeed, if you look into what the current implementation of getMedian is doing, you’ll observe that we only care about one or two values of the list.

With that in mind, sorting the entire list before accessing one or two positions is not only overkill, but it’s also costly. If all we want to do is find the k-th smallest element of a list, the Quickselect algorithm will give you that answer in linear time. In C++, you could call the function nth_element.

I’ll not go into detail for the idea and implementation of those algorithms since they are not the main part of this note. However, I really recommend checking them out. They might come in handy (and they certainly have to me.)

Runtime

Nothing changed in addNumber from the previous implementation. It still sits at a wonderful O(1).

But now the runtime o getMedian sucks less! The one or two runs of the Quickselect algorithm will take linear time (O(N)), which is an improvement from the previous O(N log N). Again, that’s O(N) per call of getMedian.

In the end, N calls of each function (alternating) will total O(N*1 + N*N), or simply O(N^2).

We can do better.

This is a picture of my dog, Fox, to serve as break before heading to the next ideas.

The actual solution

Let’s make Yogi Berra proud and watch a little more.

Every time we query for the one or two elements needed to find the median, we always query for the “same” position in the list. That is, we are always trying to figure out the middle of the list (we are interested in the median, after all.)

Also, first or second-year CS students learn about basic data structures that let us do insertions while preserving some type of order. The most popular ones are ordered sets and heaps. Those structures can only efficiently give you the smallest/largest element in a list. If you try to find out the middle element of a set, you’re gonna have a bad [run]time. Let’s assume then we can only know the smallest or largest element of a list, and solve the problem with that tool.

Finding the middle

Forget about the problem setup for a second.

Say we have a list S. Let’s split the list in two halves, and put the smaller elements on a list L and the largest at R. If S originally had N elements and N is even, L and R each have N/2. If N is odd, let’s say L has the smallest (N+1)/2 and R has the largest (N-1)/2 elements.

How do we find the median of S?

If N is odd, the median of S is the smallest (N+1)/2-th element (1-indexed). This is the largest element of L!

If N is even, the median will be the average of the smallest N/2-th and (N+2)/2-th elements (1-indexed again). In other words, this will be the average of the largest element of L and the smallest element of R.

Thus, what we need to efficiently find a median is for L to provide us with its largest element quickly, and for R to give us its smallest element. We can simply make L be a max-heap and R be a min-heap.

Updating `L` and `R`

Since L and R were defined as the left and right halves of the original list, there are two properties in them we actually care about:

No element in L is greater than an element in R, and
0 <= size(L) - size(R) <= 1. This is equivalent of the previous definition we used of allocating the elements as a function of N. However, this doesn’t make us depend on N, which is nice.

Assume those two properties are currently being respected. When a new element X arrives, we want to add it to either L or R, and make sure those properties continue being respected. This is an overview of the algorithm:

If X > max(L), add it to R. Otherwise, add it to L.
- There’s a corner case to watch out here: if L is empty (as it will be in the beginning of the algorithm), max(L) would be undefined. We handle this case by simply adding X to L when it is empty.
If size(R) > size(L), add min(R) to L, and then remove it from R.
If size(L) > size(R) + 1, add max(L) to R and remove it from L.

The first bullet point guarantees property 1 is respected. The second and third bullet points are updating the lists in order to enforce property 2 (without breaking 1).

Note that, just as we expected, we only need to interact with the largest element in L and the smallest element in R. This makes our use of min and max-heaps appropriate.

Ok, where’s the code?

Here:

class MedianStream {
 public:
  // Constructor
  MedianStream() {
    numValues = 0;
  }
  
  // Adds integer value to the stream
  void addNumber(int x) {
    numValues++;
    
    // The algorithm described above
    if (!leftHalf.empty() && x > leftHalf.top()) {
      rightHalf.push(x);
    } else {
      leftHalf.push(x);
    }
    
    if (rightHalf.size() > leftHalf.size()) {
      leftHalf.push(rightHalf.top());
      rightHalf.pop();
    }
    
    if (leftHalf.size() > rightHalf.size() + 1) {
      rightHalf.push(leftHalf.top());
      leftHalf.pop();
    }
  }
  
  // Returns median of all numbers seen so far
  // Assuming this function is not called on empty lists
  double getMedian() {
    // Handling cases for odd and even numValues
    if (numValues % 2 == 0) {
      return (leftHalf.top() + rightHalf.top())/2.0;
    } else {
      return leftHalf.top();
    }
  }

 private:
  int numValues;
  // This is a max-heap in C++:
  priority_queue<int> leftHalf;
  // This is a min-heap in C++:
  priority_queue<int, vector<int>, greater<int>> rightHalf;
};

[Github gist link]

I heavily used the priority_queue structure from C++. Most languages have some default min/max-heap. Coding your own heap is not hard, but why would you do that?

Runtime

It may feel like we took a step backwards with addNumber. Before, without any thinking whatsoever, we had an O(1) implementation for it. Now, we have these [constant number] of heap pushes and pops. Remember that each of those operations takes O(log N) time. Thus, our new version of addNumber costs us O(log N).

We improved getMedian by as much as we possibly could. Its very short implementation is not deceiving. All we do is query for one or two roots of priority queues. Those happen in constant time. Then, this amazing version of getMedian costs… O(1).

After N operations of each of them, we’ll settle at O(N log N + N * 1), or simply O(N log N).

This is how you pass an interview.

Less code and less elegance

Don’t get me wrong, I love elegant solutions like the one above. It’s nice solving “complex” problems using only basic tools. Any introductory course to data structures will teach heaps, and that is literally all you need to solve this “hard” problem.

But I’m also a competitive programmer. When push comes to shove, I’d gladly throw away all my finesse if can save some lines or thinking time. Especially if the runtime is unaffected in the process.

Basically, what I’m saying is that this problem could be solved by importing a data structure. Yup. Take a look at the Wikipedia page on the Ordered Statistic Tree. It supports:

Insertions of numbers in O(log N).
Finding the k-th smallest number in O(log N) time.

It even feels like it was tailor-made for this problem. In fact, the ordered statistic tree supports even more operations than those two.

Another data structure that would immediately solve this problem (by supporting the two previous operations in O(log N) time) is the Treap. In my previous experience, it is much easier to understand a Treap than to actually implement one. Be my guest and have fun coding one.

Overall

This problem can be solved in O(N log N) time for N operations of each kind (alternating). There is a solution I find beautiful, and one that is less so. There are also brute-force solutions, but they are for losers. You’re not one.

I believe beauty in problem-solving has a time and a place. This blog is the place for it. I love the fact that this problem can be solved with primitive tools, rather than using complicated ones. When it’s your turn to solve this problem, feel free to choose the solution you want. I’m not your dad.

Just make sure it is fast.

Lucca's Blog

Discussion about this post

Lucca's Blog

Classic Problem: Streaming Median

For some reason, this is considered hard on LeetCode

Task

Example

Lazy brute force

Runtime

Brute force (but not lazy)

Runtime

The actual solution

Finding the middle

Updating L and R

Ok, where’s the code?

Runtime

Less code and less elegance

Overall

Discussion about this post

Updating `L` and `R`