How to maximize the cache hit rate of the 2-element combinations?

Asked 30/8, 2023 at 15:58 Answered 1/9, 2023 at 21:51

Solved python algorithm caching combinations

My question is simple, but I find it difficult to get the point straight, so please allow me to explain step by step.

Suppose I have N items and N corresponding indices. Each item can be loaded using the corresponding index.

def load_item(index: int) -> ItemType:
    # Mostly just reading, but very slow.
    return item

Also I have a function that takes two (loaded) items and calculates a score.

def calc_score(item_a: ItemType, item_b: ItemType) -> ScoreType:
    # Much faster than load function.
    return score

Note that calc_score(a, b) == calc_score(b, a).

What I want to do is calculate the score for all 2-item combinations and find (at least) one combination that gives the maximum score.

This can be implemented as follows:

def dumb_solution(n: int) -> Tuple[int, int]:
    best_score = 0
    best_combination = None
    for index_a, index_b in itertools.combinations(range(n), 2):
        item_a = load_item(index_a)
        item_b = load_item(index_b)
        score = calc_score(item_a, item_b)
        if score > best_score:
            best_score = score
            best_combination = (index_a, index_b)
    return best_combination

However, this solution calls the load_item function 2*C(N,2) = N*(N-1) times, which is the bottleneck for this function.

This can be resolved by using a cache. Unfortunately, however, the items are so large that it is impossible to keep all items in memory. Therefore, we need to use a size-limited cache.

from functools import lru_cache

@lru_cache(maxsize=M)
def load(index: int) -> ItemType:
    # Very slow process.
    return item

Note that M (cache size) is much smaller than N (approx. N // 10 to N // 2).

The problem is that the typical sequence of combinations is not ideal for the LRU cache.

For instance, when N=6, M=3, itertools.combinations generates the following sequence, and the number of calls of the load_item function is 17.

[
    (0, 1),  # 1, 2
    (0, 2),  # -, 3
    (0, 3),  # -, 4
    (0, 4),  # -, 5
    (0, 5),  # -, 6
    (1, 2),  # 7, 8
    (1, 3),  # -, 9
    (1, 4),  # -, 10
    (1, 5),  # -, 11
    (2, 3),  # 12, 13
    (2, 4),  # -, 14
    (2, 5),  # -, 15
    (3, 4),  # 16, 17
    (3, 5),  # -, -
    (4, 5),  # -, -
]

However, if I rearrange the above sequence as follows, the number of calls will be 10.

[
    (0, 1),  # 1, 2
    (0, 2),  # -, 3
    (1, 2),  # -, -
    (0, 3),  # -, 4
    (2, 3),  # -, -
    (0, 4),  # -, 5
    (3, 4),  # -, -
    (0, 5),  # -, 6
    (4, 5),  # -, -
    (1, 4),  # 7, -
    (1, 5),  # -, -
    (1, 3),  # -, 8
    (3, 5),  # -, -
    (2, 5),  # 9, -
    (2, 4),  # -, 10
]

Question:

How can I generate a sequence of 2-item combinations that maximizes the cache hit rate?

What I tried:

The solution I came up with is to prioritize items that are already in the cache.

from collections import OrderedDict


def prioritizes_item_already_in_cache(n, cache_size):
    items = list(itertools.combinations(range(n), 2))
    cache = OrderedDict()
    reordered = []

    def update_cache(x, y):
        cache[x] = cache[y] = None
        cache.move_to_end(x)
        cache.move_to_end(y)
        while len(cache) > cache_size:
            cache.popitem(last=False)

    while items:
        # Find a pair where both are cached.
        for i, (a, b) in enumerate(items):
            if a in cache and b in cache:
                reordered.append((a, b))
                update_cache(a, b)
                del items[i]
                break
        else:
            # Find a pair where one of them is cached.
            for i, (a, b) in enumerate(items):
                if a in cache or b in cache:
                    reordered.append((a, b))
                    update_cache(a, b)
                    del items[i]
                    break
            else:
                # Cannot find item in cache.
                a, b = items.pop(0)
                reordered.append((a, b))
                update_cache(a, b)

    return reordered

For N=100, M=10, this sequence resulted in 1660 calls, which is about 1/3 of the typical sequence. For N=100, M=50 there are only 155 calls. So I think I can say that this is a promising approach.

Unfortunately, this function is too slow and useless for large N. I was not able to finish for N=1000, but the actual data is in the tens of thousands. Also, it does not take into account how to select an item when no cached item is found. Therefore, even if it is fast, it is doubtful that it is theoretically the best solution (so please note my question is not how to make the above function faster).

(Edited) Here is the complete code including everyone's answers and the test and benchmark code.

import functools
import itertools
import math
import time
from collections import Counter, OrderedDict
from itertools import chain, combinations, product
from pathlib import Path
from typing import Callable, Iterable, Tuple

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image, ImageDraw

ItemType = int
ScoreType = int


def load_item(index: int) -> ItemType:
    return int(index)


def calc_score(item_a: ItemType, item_b: ItemType) -> ScoreType:
    return abs(item_a - item_b)


class LRUCacheWithCounter:
    def __init__(self, maxsize: int):
        def wrapped_func(key):
            self.load_count += 1
            return load_item(key)

        self.__cache = functools.lru_cache(maxsize=maxsize)(wrapped_func)
        self.load_count = 0

    def __call__(self, key: int) -> int:
        return self.__cache(key)


def basic_loop(iterator: Iterable[Tuple[int, int]], cached_load: Callable[[int], int]):
    best_score = 0
    best_combination = None
    for i, j in iterator:
        a = cached_load(i)
        b = cached_load(j)
        score = calc_score(a, b)
        if score > best_score:
            best_score = score
            best_combination = (i, j)
    return best_score, best_combination


def baseline(n, _):
    return itertools.combinations(range(n), 2)


def prioritizes(n, cache_size):
    items = list(itertools.combinations(range(n), 2))
    cache = OrderedDict()
    reordered = []

    def update_cache(x, y):
        cache[x] = cache[y] = None
        cache.move_to_end(x)
        cache.move_to_end(y)
        while len(cache) > cache_size:
            cache.popitem(last=False)

    while items:
        # Find a pair where both are cached.
        for i, (a, b) in enumerate(items):
            if a in cache and b in cache:
                reordered.append((a, b))
                update_cache(a, b)
                del items[i]
                break
        else:
            # Find a pair where one of them is cached.
            for i, (a, b) in enumerate(items):
                if a in cache or b in cache:
                    reordered.append((a, b))
                    update_cache(a, b)
                    del items[i]
                    break
            else:
                # Cannot find item in cache.
                a, b = items.pop(0)
                reordered.append((a, b))
                update_cache(a, b)

    return reordered


def Matt_solution(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    dest = []

    def findPairs(lo1: int, n1: int, lo2: int, n2: int):
        if n1 < 1 or n2 < 1:
            return
        if n1 == 1:
            for i in range(max(lo1 + 1, lo2), lo2 + n2):
                dest.append((lo1, i))
        elif n2 == 1:
            for i in range(lo1, min(lo1 + n1, lo2)):
                dest.append((i, lo2))
        elif n1 >= n2:
            half = n1 // 2
            findPairs(lo1, half, lo2, n2)
            findPairs(lo1 + half, n1 - half, lo2, n2)
        else:
            half = n2 // 2
            findPairs(lo1, n1, lo2, half)
            findPairs(lo1, n1, lo2 + half, n2 - half)

    findPairs(0, n, 0, n)
    return dest


def Kelly_solution(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    k = cache_size // 2
    r = range(n)
    return chain.from_iterable(combinations(r[i : i + k], 2) if i == j else product(r[i : i + k], r[j : j + k]) for i in r[::k] for j in r[i::k])


def Kelly_solution2(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    k = cache_size - 2
    r = range(n)
    return chain.from_iterable(combinations(r[i : i + k], 2) if i == j else product(r[i : i + k], r[j : j + k]) for i in r[::k] for j in r[i::k])


def diagonal_block(lower, upper):
    for i in range(lower, upper + 1):
        for j in range(i + 1, upper + 1):
            yield i, j


def strip(i_lower, i_upper, j_lower, j_upper):
    for i in range(i_lower, i_upper + 1):
        for j in range(j_lower, j_upper + 1):
            yield i, j


def btilly_solution(n: int, cache_size: int):
    i_lower = 0
    i_upper = n - 1
    k = cache_size - 2
    is_asc = True
    while i_lower <= i_upper:
        # Handle a k*k block first. At the end that is likely loaded.
        if is_asc:
            upper = min(i_lower + k - 1, i_upper)
            yield from diagonal_block(i_lower, upper)
            j_lower = i_lower
            j_upper = upper
            i_lower = upper + 1
        else:
            lower = max(i_lower, i_upper - k + 1)
            yield from diagonal_block(lower, i_upper)
            j_lower = lower
            j_upper = i_upper
            i_upper = lower - 1
        yield from strip(i_lower, i_upper, j_lower, j_upper)
        is_asc = not is_asc


def btilly_solution2(n: int, cache_size: int):
    k = cache_size - 2
    for top in range(0, n, k):
        bottom = top + k
        # Diagonal part.
        for y in range(top, min(bottom, n)):  # Y-axis Top to Bottom
            for x in range(y + 1, min(bottom, n)):  # X-axis Left to Right
                yield y, x
        # Strip part.
        # Stripping right to left works well when cache_size is very small, but makes little difference when it is not.
        for x in range(n - 1, bottom - 1, -1):  # X-axis Right to Left
            for y in range(top, min(bottom, n)):  # Y-axis Top to Bottom
                yield y, x


def btilly_solution3(n: int, cache_size: int):
    k = cache_size - 2
    r = range(n)
    for i in r[::k]:
        yield from combinations(r[i : i + k], 2)
        yield from product(r[i + k :], r[i : i + k])


def btilly_solution4(n: int, cache_size: int):
    def parts():
        k = cache_size - 2
        r = range(n)
        for i in r[::k]:
            yield combinations(r[i : i + k], 2)
            yield product(r[i + k :], r[i : i + k])

    return chain.from_iterable(parts())


def plot(df, series, ignore, y, label, title):
    df = df[df["name"].isin(series)]
    # plt.figure(figsize=(10, 10))
    for name, group in df.groupby("name"):
        plt.plot(group["n"], group[y], label=name)

    y_max = df[~df["name"].isin(ignore)][y].max()
    plt.ylim(0, y_max * 1.1)

    plt.xlabel("n")
    plt.ylabel(label)
    plt.title(title)
    plt.legend(loc="upper left")
    plt.tight_layout()
    plt.grid()
    plt.show()


def run(func, n, cache_ratio, output_dir: Path):
    cache_size = int(n * cache_ratio / 100)
    output_path = output_dir / f"{n}_{cache_ratio}_{func.__name__}.csv"
    if output_path.exists():
        return

    started = time.perf_counter()
    for a, b in func(n, cache_size):
        pass
    elapsed_iterate = time.perf_counter() - started

    # test_combinations(func(n, cache_size), n)

    started = time.perf_counter()
    cache = LRUCacheWithCounter(cache_size)
    basic_loop(iterator=func(n, cache_size), cached_load=cache)
    elapsed_cache = time.perf_counter() - started

    output_path.write_text(f"{func.__name__},{n},{cache_ratio},{cache_size},{cache.load_count},{elapsed_iterate},{elapsed_cache}")


def add_lower_bound(df):
    def calc_lower_bound(ni, mi):
        n = ni
        m = n * mi // 100
        return m + math.ceil((math.comb(n, 2) - math.comb(m, 2)) / (m - 1))

    return pd.concat(
        [
            df,
            pd.DataFrame(
                [
                    {"name": "lower_bound", "n": ni, "m": mi, "count": calc_lower_bound(ni, mi)}
                    for ni, mi in itertools.product(df["n"].unique(), df["m"].unique())
                ]
            ),
        ]
    )


def benchmark(output_dir: Path):
    log_dir = output_dir / "log"
    log_dir.mkdir(parents=True, exist_ok=True)

    candidates = [
        baseline,
        prioritizes,
        Matt_solution,
        Kelly_solution,
        Kelly_solution2,
        btilly_solution,
        btilly_solution2,
        btilly_solution3,
        btilly_solution4,
    ]

    nc = np.linspace(100, 500, num=9).astype(int)
    # nc = np.linspace(500, 10000, num=9).astype(int)[1:]
    # nc = np.linspace(10000, 100000, num=9).astype(int).tolist()[1:]
    print(nc)

    mc = np.linspace(10, 50, num=2).astype(int)
    print(mc)

    joblib.Parallel(n_jobs=1, verbose=5, batch_size=1)([joblib.delayed(run)(func, ni, mi, log_dir) for ni in nc for mi in mc for func in candidates])


def plot_graphs(output_dir: Path):
    log_dir = output_dir / "log"

    results = []
    for path in log_dir.glob("*.csv"):
        results.append(path.read_text().strip())
    (output_dir / "stat.csv").write_text("\n".join(results))

    df = pd.read_csv(output_dir / "stat.csv", header=None, names=["name", "n", "m", "size", "count", "time", "time_full"])
    df = add_lower_bound(df)
    df = df.sort_values(["name", "n", "m"])

    for m in [10, 50]:
        plot(
            df[df["m"] == m],
            series=[
                baseline.__name__,
                prioritizes.__name__,
                Matt_solution.__name__,
                Kelly_solution.__name__,
                Kelly_solution2.__name__,
                btilly_solution.__name__,
                "lower_bound",
            ],
            ignore=[
                baseline.__name__,
                prioritizes.__name__,
            ],
            y="count",
            label="load count",
            title=f"cache_size = {m}% of N",
        )

    plot(
        df[df["m"] == 10],
        series=[
            baseline.__name__,
            prioritizes.__name__,
            Matt_solution.__name__,
            Kelly_solution.__name__,
            Kelly_solution2.__name__,
            btilly_solution.__name__,
            btilly_solution2.__name__,
            btilly_solution3.__name__,
            btilly_solution4.__name__,
        ],
        ignore=[
            prioritizes.__name__,
            Matt_solution.__name__,
        ],
        y="time",
        label="time (sec)",
        title=f"cache_size = {10}% of N",
    )


class LRUCacheForTest:
    def __init__(self, maxsize: int):
        self.cache = OrderedDict()
        self.maxsize = maxsize
        self.load_count = 0

    def __call__(self, key: int) -> int:
        if key in self.cache:
            value = self.cache[key]
            self.cache.move_to_end(key)
        else:
            if len(self.cache) == self.maxsize:
                self.cache.popitem(last=False)
            value = load_item(key)
            self.cache[key] = value
            self.load_count += 1
        return value

    def hit(self, i, j):
        count = int(i in self.cache)
        self(i)
        count += int(j in self.cache)
        self(j)
        return count


def visualize():
    # Taken from https://mcmap.net/q/697569/-how-to-maximize-the-cache-hit-rate-of-the-2-element-combinations and modified.
    n, m = 100, 30
    func = btilly_solution2

    pairs = func(n, m)
    cache = LRUCacheForTest(m)

    # Create the images, save as animated png.
    images = []
    s = 5
    img = Image.new("RGB", (s * n, s * n), (255, 255, 255))
    draw = ImageDraw.Draw(img)

    colors = [(255, 0, 0), (255, 255, 0), (0, 255, 0)]
    for step, (i, j) in enumerate(pairs):
        draw.rectangle((s * j, s * i, s * j + s - 2, s * i + s - 2), colors[cache.hit(i, j)])
        if not step % 17:
            images.append(img.copy())

    images += [img] * 40

    images[0].save(f"{func.__name__}_{m}.gif", save_all=True, append_images=images[1:], optimize=False, duration=30, loop=0)


def test_combinations(iterator: Iterable[Tuple[int, int]], n: int):
    # Note that this function is not suitable for large N.
    expected = set(frozenset(pair) for pair in itertools.combinations(range(n), 2))
    items = list(iterator)
    actual = set(frozenset(pair) for pair in items)
    assert len(actual) == len(items), f"{[item for item, count in Counter(items).items() if count > 1]}"
    assert actual == expected, f"dup={actual - expected}, missing={expected - actual}"


def test():
    n = 100  # N
    cache_size = 30  # M

    def run(func):
        func(n, cache_size)

        # Measure generation performance.
        started = time.perf_counter()
        for a, b in func(n, cache_size):
            pass
        elapsed = time.perf_counter() - started

        # Test generated combinations.
        test_combinations(func(n, cache_size), n)

        # Measure cache hit (load count) performance.
        cache = LRUCacheWithCounter(cache_size)
        _ = basic_loop(iterator=func(n, cache_size), cached_load=cache)
        print(f"{func.__name__}: {cache.load_count=}, {elapsed=}")

    candidates = [
        baseline,
        prioritizes,
        Matt_solution,
        Kelly_solution,
        Kelly_solution2,
        btilly_solution,
        btilly_solution2,
        btilly_solution3,
        btilly_solution4,
    ]
    for f in candidates:
        run(f)


def main():
    test()
    visualize()

    output_dir = Path("./temp2")
    benchmark(output_dir)
    plot_graphs(output_dir)


if __name__ == "__main__":
    main()

I have no problem with you not using the above test code or changing the behavior of basic_loop or LRUCacheWithCounter.

Additional Note:

The score calculation cannot be pruned using neighbor scores.
The score calculation cannot be pruned using only a portion of the item.
It is impossible to guess where the best combination will be.
Using faster media is one option, but I'm already at my limit, so I'm looking for a software solution.

Thank you for reading this long post to the end.

Edit:

Thanks to btilly's answer and help with Kelly's visualization, I have come to the conclusion that btilly's solution is the best and (possibly) optimal one.

Here is a theoretical explanation (although I am not very good at math, so it could be wrong).

Let N represent the number of indexes, M the cache size, and C the number of combinations (same as math.comb).

Consider a situation where the cache is full and no further combinations can be generated without loading. If we add a new index at this point, the only combinations that can be generated are combinations of the newly added index and the remaining indexes in the cache. This pattern holds for each subsequent iteration. Hence, while the cache is full, the maximum number of combinations can be generated per load is M - 1.

This logic holds if the cache isn't full as well. If M' indexes are currently in the cache, then the next index can generate at most M' combinations. The subsequent index can generate at most M' + 1 combinations, and so forth. In total, at most C(M,2) combinations can be generated before the cache is full.

Thus, to generate C(N,2) combinations, at least M loads are required to fill the cache, at least (C(N,2) - C(M,2)) / (M - 1) loads are required after the cache is filled.

From above, the load counts complexity of this problem is Ω(N^2 / M).

I have plotted this formula as a lower_bound in the graphs below. Note that it is only a lower bound and no guarantee that it can actually be achieved.

As an aside, Kelly's solution needs to configure k to maximize its performance. For M = 50% of N, it's about M * 2/3. For M = 30% of N, it's about M * 5/6. Although I couldn't figure out how to calculate it. As a general configuration, I use k = M - 2 (which is not best, but relatively good) in the Kelly_solution2 in the graphs below.

For M = 10% of N:

For M = 50% of N:

Note that, in these graphs, it looks like O(N), but this is because I determined M based on N. When M does not change, it is O(N^2) as described above.

Here is an animation visualizing the cache hit rate of btilly_solution2, composed by a modified version of Kelly's code. Each pixel represents a combination, with red representing combinations where both indexes are loaded, yellow where one index is loaded, and green where neither index is loaded.

In addition, since I'm looking for the optimal sequence, execution time doesn't matter much. But just in case anyone is curious, here is a comparison of execution times (iteration only).

btilly_solution4 (btilly's solution modified by Kelly) is almost as fast as itertools.combinations, which should be optimal in this case. Note, however, that even without the modification, it took only 112 nanoseconds per combination.

That's it. Thanks to everyone involved.

Manumission answered 30/8, 2023 at 15:58 Comment(16)

"However, this solution calls the load_item function N(N+1) times, which is the bottleneck for this function." <<< I think you mean N(N+1)/2 and not N(N+1) – Nevil 30/8, 2023 at 16:58

So your items are very large, but you have a calc_score function which is relatively fast. I'm guessing this calc_score function only needs some features of the items, and not the items as a whole? In this case, load the items, but instead of keeping them all in memory, only keep the relevant information which is needed by calc_score. – Nevil 30/8, 2023 at 17:0

Well, my previous comment appears to be defeated by your note "The score calculation cannot be pruned using only a portion of the item.", but then I don't understand how the items can be too large while calc_score is fast enough. – Nevil 30/8, 2023 at 17:2

What else do you know about calc_score? You said that calc_score(a, b) == calc_score(b, a), which is very good. Perhaps calc_score has other cool properties. Can calc_score represent something akin to geographic proximity, so that if (a, b) has a good score and (b, c) has a bad score, then (a, c) has a bad score? Or geographic distance, so that if (a, b) has a bad score and (b, c) has a bad score, then (a, c) has a bad score? Maybe the triangular inequality calc_score(a, b) + calc_score(b, c) >= calc_score(a, c) or calc_score(a, c) >= abs(calc_score(a, b) - calc_score(b, c))? – Nevil 30/8, 2023 at 17:5

Your best bet might be to look for a heuristic to "navigate" the items and find close pairs, and try to prune the search with this heuristic by guessing which pairs will have a bad score and don't need to be evaluated. – Nevil 30/8, 2023 at 17:7

@Nevil Thanks for the comments. Without cache, two items must be loaded for each combination, so N(N+1)/2*2 = N(N+1). calc_score is user-defined (by me, to be exact), and it should be possible to change it in the future. So, I like to avoid adding constraints whenever possible. Currently, only commutativity can be guaranteed. calc_score is fast because the machine has powerful CPUs (and weak IO). – Manumission 30/8, 2023 at 18:3

While pruning would be effective, I have to consider the dilemma that the cost of implementation would be more than the cost that could be reduced. Since I (will) have several score functions, but most of them will be abandoned. – Manumission 30/8, 2023 at 18:4

Your testing doesn't test that the result (the list of pairs) is correct, does it? It really should. – Ketchan 30/8, 2023 at 19:17

@KellyBundy You are right. I added a function to test if the generated sequence is correct and confirmed (with some modifications) that all answers received so far passed this test. – Manumission 31/8, 2023 at 12:18

About your benchmark: Your actual demo usage has for index_a, index_b in ...:, i.e., you don't keep a reference to each pair. But your benchmark's for _ in func(n, cache_size): does, and that prevents optimizations, making some solutions look slower. You could do deque(func(n, cache_size), 0) instead of your loop, that would allow those optimizations again and also would be less overhead than a Python-level loop. Btw I'm curious how you changed our solutions to generators. That probably made them slower. Maybe your edit would better be an answer, with tested codes included. – Ketchan 1/9, 2023 at 12:38

@KellyBundy Hmmm. I did not know that. Thanks for pointing that out. Since for a, b in func(...) should be the closest to my actual code, I will re-run the benchmark with it. However, I believe the difference in execution speed is mainly caused by something else, so I think it will remain the same in the big picture. – Manumission 1/9, 2023 at 17:59

I see for the fast ones, you're still really not measuring their producer speed but mostly the speed of your consumer loop. For example for n=20000 and m=n//10 and btilly_solution4, your for a, b in func(n, cache_size): pass took me 15 seconds while deque(func(n, cache_size), 0) only took 2.4 seconds. So that's making it look over six times slower than it actually is. (Granted, its speed and the speeds of most of the solutions are likely insignificant compared to the actual work you do with the pairs. But I still think inflating the times with consumer-time is misleading.) – Ketchan 2/9, 2023 at 16:51

Frankly, I don't think so. If someone tells me that btilly_solution4 runs 6x faster, that is misleading. Because that would not happen in my usage. What matters to me (and anyone who uses it as I do) is how much impact it has on the overall execution time of the program, and if it "doesn't" then that's exactly the answer we need. – Manumission 2/9, 2023 at 17:27

Hmm, but you're not showing how much impact the solutions have, you're showing how much impact the solutions plus that for-loop have. The deque would have a much lower overhead and be as close as possible to the actual impact of the solutions. – Ketchan 2/9, 2023 at 20:24

For example, comparing baseline and Kelly_solution with deque(func(n, cache_size), 0), at N=100k, on my PC (not the same one I used above), I got 21 and 38 seconds, 17 seconds difference. So (ignoring the difference in cache hit rates) changing baseline to Kelly_solution will impact my program by 17 seconds? No. The comparison using for-loop was 66 and 74 seconds, 8 seconds difference. Since I use for-loop, 8 seconds difference would be closer to the actual impact on my program. – Manumission 3/9, 2023 at 4:21

Are those times stable, i.e., you consistently get ~17 and ~8 seconds difference? I tried that multiple times and the for loop gave ~176 and ~189 seconds (13 seconds difference) while the deque gave ~28 and ~40 seconds (12 seconds difference). In any case, the plot doesn't plot 8 seconds but 38 seconds (using the times from this new PC). Oh well... I think I'm done, we both understand the times and just take different perspectives. – Ketchan 3/9, 2023 at 11:17

Here is a simple approach that depends on the cache and gets 230 on your benchmark.

def diagonal_block (lower, upper):
    for i in range(lower, upper + 1):
        for j in range(i, upper + 1):
            yield (i, j)

def strip (i_lower, i_upper, j_lower, j_upper):
    for i in range(i_lower, i_upper+1):
        for j in range (j_lower, j_upper + 1):
            yield (i, j)

# def your_solution_here(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
def your_solution_here(n: int, cache_size: int):
    i_lower = 0
    i_upper = n-1
    k = cache_size - 2
    is_asc = True
    while i_lower <= i_upper:
        # Handle a k*k block first. At the end that is likely loaded.
        if is_asc:
            upper = min(i_lower + k - 1, i_upper)
            yield from diagonal_block(i_lower, upper)
            j_lower = i_lower
            j_upper = upper
            i_lower = upper + 1
        else:
            lower = max(i_lower, i_upper - k + 1)
            yield from diagonal_block(lower, i_upper)
            j_lower = lower
            j_upper = i_upper
            i_upper = lower - 1
        yield from strip(i_lower, i_upper, j_lower, j_upper)
        is_asc = not is_asc

A comment about how I thought this one up.

We want to compare a group of objects with every other uncompared object. The group should be everything that fits in the cache except one.

So we start with the first k objects, compare them with each other, then just proceed along in a strip to the end.

And now we need our second group. Well, we already have the last object, and we don't need the rest. So we take k objects from the end, make that a group. Compare the group with itself, then proceed along a strip to the first object outside of our original group.

Now reverse direction, and so on.

At all points, i_lower represents the first object still needing comparing, and i_upper represents the last. If we're going forward, we take k objects starting at i_lower. If we're going backwards we take k objects starting at i_upper and go backwards.

When I was implementing it, there were two complications. The first is that we have to worry about the edge condition when we meet in the middle. The second is that we might have to do the strip in 2 directions.

I chose to only do the strip ascending. This is actually a bug. On most of the ascending loads, I did not get the first element in my cache. Oops. But it is still pretty good.

Winnah answered 30/8, 2023 at 18:46 Comment(8)

this is producing the wrong number of pairs, so something is off – Tarriance 30/8, 2023 at 18:58

@MattTimmermans Oops. That was an off by one error in the diagonal block. Note that because (i, j) gives the same result as (j, i), I am only producing each pair once. – Winnah 30/8, 2023 at 20:29

@KellyBundy Why too high? With n = 100 I am producing 5050 pairs. The number of distinct pairs of n things is n * (n+1) / 2 = 100 * 101 / 2 = 5050. Am I missing something? – Winnah 31/8, 2023 at 3:49

Thank you for your answer. Your code works great, but it also generates the case i == j. That is, this is equivalent to itertools.combinations_with_replacement. To make it equivalent to itertools.combinations, I had to change the inner loop of diagonal_block to start from i + 1. Except for that, I have verified that there are no missing or duplicate combinations generated. – Manumission 31/8, 2023 at 12:17

This could be shorter/faster/clearer if it used itertools instead of reimplementing them. – Ketchan 1/9, 2023 at 12:17

Couldn't understand it, so I visualized it. Still can't understand it :-) – Ketchan 1/9, 2023 at 15:57

@KellyBundy I don't think I could have thought of this if I used itertools. If you want to understand it visually, I recommend making m small relative to n and color (j, i) at the same time as (i, j). I will add a note about the underlying concept. – Winnah 1/9, 2023 at 18:50

Ah, ok, that is starting to make sense :-) – Ketchan 1/9, 2023 at 20:24

Going in k×k blocks. With k=25, it gets 372 loads on your benchmark. With k=cache_size/2, it gets 570 loads.

from itertools import combinations, product, chain

def your_solution_here(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    k = cache_size // 2
    r = range(n)
    return list(chain.from_iterable(
        combinations(r[i:i+k], 2) if i == j else product(r[i:i+k], r[j:j+k])
        for i in r[::k]
        for j in r[i::k]
    ))

Ketchan answered 30/8, 2023 at 18:47 Comment(0)

Here's a simple recursively-defined ordering that doesn't depend on the cache size and gets 566 loads on your benchmark:

def cache_oblivious(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    dest = []
    def findPairs(lo1: int, n1: int, lo2: int, n2: int):
        if n1 < 1 or n2 < 1:
            return
        if n1 == 1:
            for i in range(max(lo1+1,lo2), lo2+n2):
                dest.append((lo1, i))
        elif n2 == 1:
            for i in range(lo1, min(lo1+n1, lo2)):
                dest.append((i, lo2))
        elif n1 >= n2:
            half = n1//2
            findPairs(lo1, half, lo2, n2)
            findPairs(lo1+half, n1-half, lo2, n2)
        else:
            half = n2//2
            findPairs(lo1, n1, lo2, half)
            findPairs(lo1, n1, lo2+half, n2-half)
    findPairs(0,n,0,n)
    return dest

Tarriance answered 30/8, 2023 at 18:16 Comment(0)

Visualizations with n=100, cache_size=30

ken's baseline, 4614 loads:

ken's prioritizes_item_already_in_cache, 963 loads:

My first one, 570 loads:

Matt's, 565 loads:

btilly's, 230 loads:

ken's btilly_solution2, 226 loads:

My btilly variation, 231 loads:

Image creation code

from PIL import Image, ImageDraw
from typing import Callable, Iterable, Tuple
from itertools import combinations, product, chain

n, m = 100, 30

def your_solution_here(n: int, cache_size: int) -> Iterable[Tuple[int, int]]:
    k = cache_size // 2
    r = range(n)
    return list(chain.from_iterable(
        combinations(r[i:i+k], 2) if i == j else product(r[i:i+k], r[j:j+k])
        for i in r[::k]
        for j in r[i::k]
    ))

'''
from btilly import your_solution_here
from Matt import your_solution_here
from ken import baseline as your_solution_here
from ken import prioritizes_item_already_in_cache as your_solution_here
'''

pairs = your_solution_here(n, m)

# Create the images, save as animated png.
images = []
s = 5
img = Image.new('RGB', (s*n, s*n), (255, 255, 255))
draw = ImageDraw.Draw(img)
color = 255,0,0
for step, (i, j) in enumerate(pairs):
    draw.rectangle([s*j, s*i, s*j+s-2, s*i+s-2], color)
    if not step % 17:
        images.append(img.copy())

images += [img] * 40

images[0].save('foo.png', save_all=True, append_images=images[1:],
               optimize=False, duration=30, loop=0)

print(len(images))
print('Done, see the created foo.png image.')

Ketchan answered 1/9, 2023 at 15:54 Comment(3)

Thanks for the nice visualization! I added this code to my post with a modification that changes the color depending on the cache hit, and a modified version of btilly's solution that makes it more understandable to me. Please try it if you are interested. Also, I did my own research on your solution and found that setting k = math.ceil(cache_size * 2 / 3) minimizes the number of loads. With N=100, M=30 I got 476 loads. – Manumission 1/9, 2023 at 18:24

@Manumission You mean minimizes in general? Since I got 372 loads with a different k there. And yeah, after I finally understood btilly's, I wrote my own variant and it's similar to yours but faster. Added visualizations for both now. – Ketchan 1/9, 2023 at 22:3

Today I learned about animated PNGs. Great visualization! – Peracid 7/9, 2023 at 7:19

A variation of btilly's, gets 231 on your benchmark:

from itertools import combinations, product, chain

def your_solution_here(n: int, cache_size: int):
    k = cache_size - 2
    r = range(n)
    for i in r[::k]:
        yield from combinations(r[i:i+k], 2)
        yield from product(r[i+k:], r[i:i+k])

Speed-optimized version:

def your_solution_here(n: int, cache_size: int):
    def parts():
        k = cache_size - 2
        r = range(n)
        for i in r[::k]:
            yield combinations(r[i:i+k], 2)
            yield product(r[i+k:], r[i:i+k])
    return chain.from_iterable(parts())

Ketchan answered 1/9, 2023 at 21:51 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++