Friday, May 5, 2017

A puzzle about game review scoring in Haskell

Everyone knows that video game reviewers tend to hype up new releases, leading to score inflation over time. To fight that, we could ask reviewers to estimate the ordinal rank of each new game (e.g. "this game will be my GOTY" or "this game will be in my top 5 this year") instead of meaningless scores like "8/10". That makes sense because people want to know how a game compares to the best, not to the worst, and whether they should spend money now rather than next month. That way, reviewers can't rank too many games too highly, scores can be compared across years, and everyone's happy.

That leads to an interesting algorithmic problem. Let's say Bob the reviewer played five games in a year, and had this to say about them at the time:

  • "Game A will be in my top 3 games released this year."
  • "Game B will be in my top 2 games released this year."
  • "Game C will be my GOTY."
  • "Game D will be in my top 3 games released this year."
  • "Game E will be in my top 5 games released this year."

That translates to giving games A,B,C,D,E scores 3,2,1,3,5 (lower means better). But somewhere around game D, it seems like Bob's reviews lost credibility, because there are four games in his top 3!

Our problem will be taking a list of such scores, and finding the length of its longest "credible" prefix. For example, for a list like [3,2,1,3,5] above, the answer should be 3.

(Try solving the puzzle yourself if you like. Spoilers below.)

First of all, what does it mean for a list to be "credible"? [1,2,3] is clearly credible, and so is [3,3,3]. But [1,2,2] isn't, because there are three games in the top 2. How do we make that precise?

After thinking for a minute, you realize that that a list is credible if its smallest element is at least 1, the next smallest is at least 2, and so on. In other words, if its sorted version is elementwise greater than [0,1,2,...]. So we can easily check credibility in O(n log n):

import Data.List (sort)

credible :: [Int] -> Bool
credible xs = and (zipWith (<) [0..] (sort xs))

That might be optimal for a pure language, though proving that seems like a research problem. If mutation is allowed, we can build a histogram instead of sorting, leading to O(n):

import Control.Monad (forM_)
import Data.Array (elems)
import Data.Array.ST (runSTArray, newArray, readArray, writeArray)

credible2 :: [Int] -> Bool
credible2 xs = and $ zipWith (>=) [0..] $ scanl1 (+) $ elems $ runSTArray $ do
    let n = length xs
    a <- newArray (0, n) 0
    forM_ xs $ \x -> do
      let i = max 0 (min n x)
      y <- readArray a i
      writeArray a i (y + 1)
    return a

To find the length of the longest credible prefix, we could just call the previous function for each prefix, leading to O(n^2 log n) without mutation and O(n^2) with mutation:

credibleLength :: [Int] -> Int
credibleLength xs = maximum [n | n <- [0..length xs], credible (take n xs)]

But that does a lot of wasted work, processing the same data over and over again. It's better to find the right prefix length using binary search, leading to O(n log^2 n) without mutation and O(n log n) with mutation:

credibleLength2 :: [Int] -> Int
credibleLength2 xs = grow 1 where
  check n = credible (take n (xs ++ repeat 0))
  grow n = if check n then grow (n * 2) else shrink (div n 2) n
  shrink m n = if m + 1 == n then m else
    let k = div (m + n) 2 in if check k then shrink k n else shrink m k

Note how the binary search has "grow" and "shrink" phases, to keep the complexity dependent on the length of the longest credible prefix instead of the whole list.

But even that's still not optimal. Without mutation, the grow phase takes O(n log n) time, but the shrink phase takes O(n log^2 n) because we're sorting subsets of the same data over and over. Instead we could sort it all once and then ignore certain elements on each pass. That requires a more complicated algorithm, pairing the elements with their indices before sorting, but lets us achieve O(n log n) without mutation:

credibleLength3 :: [Int] -> Int
credibleLength3 xs = grow 1 where
  grow n = if check n then grow (2 * n) else shrink (div n 2) n where
    cache = sort (take n (zipWith (,) (xs ++ repeat 0) [0..]))
    check k = and (zipWith (<) [0..] (map fst (filter ((<k) . snd) cache)))
    shrink m n = if m + 1 == n then m else
      let k = div (m + n) 2 in if check k then shrink k n else shrink m k

But even that can be improved! If mutation is allowed, we can get very close to O(n) by using a completely different approach, keeping track of occupied intervals in a union-find data structure. The only problem is that it won't work as well with streaming, because we need enough memory for the whole list. But otherwise this ugly imperative-looking code should have the best asymptotic time so far:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE FlexibleContexts #-}

import Control.Monad (when)
import Control.Monad.ST (runST)
import Data.Array.ST (STArray, newArray, readArray, writeArray)
import Data.UnionFind.ST (Point, descriptor, fresh, repr, union)

credibleLength4 :: [Int] -> Int
credibleLength4 xs = runST $ do
  let n = length xs
  (arr :: STArray s Int (Maybe (Point s Int))) <- newArray (1, n) Nothing
  let
    update x nothing just = do
      val <- readArray arr x
      maybe nothing just val
    insert x = do
      point <- fresh x
      writeArray arr x (Just point)
      when (x > 1) (update (x - 1) (return ()) (\p -> union point p))
      when (x < n) (update (x + 1) (return ()) (\p -> union p point))
      return True
    check x = if x < 1 then return False else update y (insert y) just where
      y = min x n
      just point = do
        z <- descriptor point
        if z == 1 then return False else insert (z - 1)
  ys <- mapM check xs
  return (length (takeWhile id ys))

To summarize, without mutation both problems seem to require O(n log n) time, but mutation allows us to get linear time on one and quasilinear on the other. Fun!

UPDATE: in the Reddit discussion users djfletch and foBrowsing have proposed using IntSet to track unoccupied positions, leading to another O(n log n) solution that's very fast in practice. Here's the code, slightly adapted from their proposals:

import Data.IntSet (empty, delete, lookupLE, union, fromList)

credibleLength5 xs = go xs 1 empty where
  go [] _ _ = 0
  go (x:xs) nextHole holes = if x < nextHole
    then maybe 0 (\y -> 1 + go xs nextHole (delete y holes)) (lookupLE x holes)
    else 1 + go xs (x + 1) (union holes (fromList [nextHole .. x - 1]))

It's impressively short, and I still need to learn a lot about Haskell to write like that!