Optimization with NumPy and Rust

This is a quick post about optimizing algorithms written in Python with NumPy, and implementing the same code in Rust. We’re going to take a look at one of the Advent of Code challenges, solve it, optimize it with NumPy, and then write both of our solutions in Rust.

If you’re unfamiliar with Advent of Code, it’s an advent calendar of programming related puzzles which has run each December since 2015. Every day from December 1st until the 25th, a new challenge is posted and you’re encouraged to submit a solution.

Analyzing the Challenge

Let’s take a look at AoC’s Day 3 challenge.

This problem requires us to count the number of overlapping claims in a given area. That area, which is the fabric in the challenge, can be represented by a 2-dimensional array of zeros in a naive solution to this problem.

Every item in the 2-dimensional array will keep a count of how many claims take ownership of that item.

Each claim can be thought of as another smaller 2-dimensional array of ones that we add to the larger 2-dimensional array. When we are finished, we can count how many items in the larger 2-dimensional array are greater than 1. Those items will represent the square inches of fabric that belong to more than one claim.

Pure Python Solution

Here’s a naive solution written in Python. While in the above analysis I said that we can think of each claim as being a 2-dimensional array of ones that we’ll add to the greater 2-dimensional array, our implementation will differ.

We’ll iterate over the Cartesian coordinates of each claim and add 1 to the corresponding item in the greater 2-dimensional array.

from __future__ import annotations
from typing import NamedTuple
from itertools import product


NO_CLAIM: int = 0
INCREMENT: int = 1
MIN_OVERLAP: int = 2
SIDE_LENGTH: int = 1050


class Claim(NamedTuple):
  x: int
  y: int
  width: int
  height: int


Claims = list[Claim]
Fabric = list[list[int]]


def get_fabric(claims: Claims, length: int = SIDE_LENGTH) -> Fabric:
  fabric = [[NO_CLAIM] * length] * length

  for x, y, width, height in claims:
    x_indices = range(x, x + width)
    y_indices = range(y, y + height)

    for x, y in product(x_indices, y_indices):
      fabric[x][y] += INCREMENT

  return fabric


def count_overlapping(fabric: Fabric) -> int:
  return sum(
    INCREMENT 
    for row in fabric 
    for item in row 
    if item >= MIN_OVERLAP
  )

# generate list of claims from challenge input file
claims: Claims = ...

# let's profile our solution
%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit count_overlapping(fabric)

Here’s our output on an Intel processor:

287 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It took nearly half of a second to process roughly one thousand claims. We can do better than that.

Optimization with NumPy

Vecorization

NumPy has a concept called Universal functions, or ufuncs, which are vectorized functions over ndarrays. Vectorized functions work over entire arrays and, if the underlying hardware supports it, can take advantage of SIMD instructions.

SIMD stands for single instruction, mutiple data. SIMD instructions allow us to apply an instruction over multiple points of data at the same time. They implement parallelism at the data-level.

SIMD diagram created by Colin M.L. Burnett for Wikipedia

Examples of SIMD instructions are MMX, SSE/SSE2/SSE3/SSE4 and AVX instruction sets on x86 processors and NEON instruction sets on ARM processors.

When all that’s said and done, there’s no guarantee that NumPy will use SIMD instructions in our program. However, we’ll still benefit from a performance boost, because NumPy is built upon natively compiled and architectually efficient data structures and operations.

I have several clustered ARMv7 machines. In the past, NumPy wasn’t optimized for ARM’s NEON instruction set in the same way that it’s optimized for x86 processors with SSE and AVX instructions. GCC’s ability to optimize NumPy at compile time will take advantage of NEON, although in a more generalized fashion.

Then, in late 2019, NumPy received some new NEON optimizations.

Even when NumPy wasn’t optimized for NEON, I still saw a performance increase when using NumPy instead of pure Python.

Optimizing Our Solution

In our previous solution, get_fabric() and count_overlapping() are sequential functions. That is to say that we apply an operation to each item in the array sequentially: we increment the first element in the array, then the next element and so on.

To optimize our solution, we’ll need to use something beyond just pure Python.

NumPy is a Python wrapper over lower-level C and Fortran code that implements optimized and natively compiled routines, so we’ll use that.

Here, we’ll use NumPy’s add() vectorized ufunc instead of sequentially adding one to each item in the fabric array.

import numpy as np

Fabric = np.ndarray

def get_fabric(claims: Claims, length: int = SIDE_LENGTH) -> Fabric:
  fabric = np.zeros((length, length), dtype=np.int8)

  for x, y, width, height in claims:
    x_indices = slice(x, x + width)
    y_indices = slice(y, y + height)
    indices = x_indices, y_indices

    fabric[indices] += INCREMENT

  return fabric

Our above solution will benefit from NumPy’s memory efficient arrays, compared to our previous solution using Python’s list type.

Next, we’ll use NumPy’s sum() routine to locate coordinates that specify values that are greater than 1.

def count_overlapping(fabric: Fabric) -> int:
  return np.sum(fabric >= MIN_OVERLAP)

Let’s see what kind of performance boost we’ll get using our new solution.

%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit count_overlapping(fabric)

Here’s our output on a relatively recent Intel processor:

12.8 ms ± 883 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.76 ms ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compared to our previous solution, our new get_fabric() implementation is 22 times faster and count_overlapping() is 63 times faster.

Not bad.

Writing It in Rust

First, we can write the naive solution in Rust and cut the run time in half compared to our NumPy solution. Then, we can use the ndarray crate to further optimize our Rust code along the lines of our NumPy code.

Here’s the naive solution in Rust.

use itertools::Itertools;

const NO_CLAIM: u8 = 0;
const INCREMENT: u8 = 1;
const MIN_OVERLAP: u8 = 2;
const SIDE_LENGTH: usize = 1000;
const START: usize = 0;

struct Claim {
  x: usize,
  y: usize,
  width: usize,
  height: usize,
}

type Claims = [Claim];
type Fabric = Vec<Vec<u8>>;

fn get_fabric(claims: &Claims) -> Fabric {
  let mut fabric = 
    vec![vec![NO_CLAIM; SIDE_LENGTH]; SIDE_LENGTH];

  for &Claim { x, y, width, height } in claims {
    let x_indices = x..(x + width);
    let y_indices = y..(y + height);

    for (x, y) in x_indices.cartesian_product(y_indices)  {
      fabric[x][y] += INCREMENT;
    }
  }

  fabric
}

fn count_overlapping(fabric: &Fabric) -> usize {
  fabric.iter()
    .flatten()
    .filter(|&&item| item >= MIN_OVERLAP)
    .count()
}

fn main() {
  let claims: Claims; //snip
  let fabric = get_fabric(&claims);
  let overlapping = count_overlapping(&fabric);
}

We can cut the run time in half again with the rough translation of the NumPy code below, for a total run time of roughly 3ms.

use ndarray::prelude::*;
use ndarray::{Array, Ix2};

type Fabric = Array<u8, Ix2>;

fn get_fabric(claims: &Claims) -> Fabric { 
  let mut fabric = 
    Array::zeros((SIDE_LENGTH, SIDE_LENGTH).f());

  for &Claim { x, y, width, height } in claims {
    let x_indices = x..(x + width);
    let y_indices = y..(y + height);

    for pt in x_indices.cartesian_product(y_indices) {
      fabric[pt] += INCREMENT;
    }
  }

  fabric
}

fn count_overlapping(fabric: &Fabric) -> usize {
  fabric.iter()
    .filter(|&&item| item >= MIN_OVERLAP)
    .count()
}