**Check out Cast Control Desktop ðŸ“º beta**

# Optimization with NumPy and Rust

This is a quick post about optimizing algorithms written in Python with NumPy, and implementing the same code in Rust. Weâ€™re going to take a look at one of the Advent of Code challenges, solve it, optimize it with NumPy, and then write both of our solutions in Rust.

If youâ€™re unfamiliar with Advent of Code, itâ€™s an advent calendar of programming related puzzles which has run each December since 2015. Every day from December 1st until the 25th, a new challenge is posted and youâ€™re encouraged to submit a solution.

## Analyzing the Challenge

Letâ€™s take a look at AoCâ€™s Day 3 challenge.

This problem requires us to count the number of overlapping claims in a given area. That area, which is the fabric in the challenge, can be represented by a 2-dimensional array of zeros in a naive solution to this problem.

Every item in the 2-dimensional array will keep a count of how many claims take ownership of that item.

Each claim can be thought of as another smaller 2-dimensional array of ones that we add to the larger 2-dimensional array. When we are finished, we can count how many items in the larger 2-dimensional array are greater than 1. Those items will represent the square inches of fabric that belong to more than one claim.

## Pure Python Solution

Hereâ€™s a naive solution written in Python. While in the above analysis I said that we can think of each claim as being a 2-dimensional array of ones that weâ€™ll add to the greater 2-dimensional array, our implementation will differ.

Weâ€™ll iterate over the Cartesian coordinates of each claim and add `1`

to the corresponding item in the greater 2-dimensional array.

```
from __future__ import annotations
from typing import NamedTuple
from itertools import product
NO_CLAIM: int = 0
INCREMENT: int = 1
MIN_OVERLAP: int = 2
SIDE_LENGTH: int = 1050
class Claim(NamedTuple):
x: int
y: int
width: int
height: int
Claims = list[Claim]
Fabric = list[list[int]]
def get_fabric(claims: Claims, length: int = SIDE_LENGTH) -> Fabric:
fabric = [[NO_CLAIM] * length] * length
for x, y, width, height in claims:
x_indices = range(x, x + width)
y_indices = range(y, y + height)
for x, y in product(x_indices, y_indices):
fabric[x][y] += INCREMENT
return fabric
def count_overlapping(fabric: Fabric) -> int:
return sum(
INCREMENT
for row in fabric
for item in row
if item >= MIN_OVERLAP
)
# generate list of claims from challenge input file
claims: Claims = ...
# let's profile our solution
%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit count_overlapping(fabric)
```

Hereâ€™s our output on an Intel processor:

```
287 ms Â± 14 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
125 ms Â± 1.62 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

It took nearly half of a second to process roughly one thousand claims. We can do better than that.

## Optimization with NumPy

### Vecorization

NumPy has a concept called Universal functions, or `ufuncs`

, which are vectorized functions over `ndarrays`

. Vectorized functions work over entire arrays and, if the underlying hardware supports it, can take advantage of SIMD instructions.

SIMD stands for **s**ingle **i**nstruction, **m**utiple **d**ata. SIMD instructions allow us to apply an instruction over multiple points of data at the same time. They implement parallelism at the data-level.

##### SIMD diagram created by Colin M.L. Burnett for Wikipedia

Examples of SIMD instructions are MMX, SSE/SSE2/SSE3/SSE4 and AVX instruction sets on x86 processors and NEON instruction sets on ARM processors.

When all thatâ€™s said and done, thereâ€™s no guarantee that NumPy will use SIMD instructions in our program. However, weâ€™ll still benefit from a performance boost, because NumPy is built upon natively compiled and architectually efficient data structures and operations.

I have several clustered ARMv7 machines. In the past, NumPy wasnâ€™t optimized for ARMâ€™s NEON instruction set in the same way that itâ€™s optimized for x86 processors with SSE and AVX instructions. GCCâ€™s ability to optimize NumPy at compile time will take advantage of NEON, although in a more generalized fashion.

Then, in late 2019, NumPy received some new NEON optimizations.

Even when NumPy wasnâ€™t optimized for NEON, I still saw a performance increase when using NumPy instead of pure Python.

### Optimizing Our Solution

In our previous solution, `get_fabric()`

and `count_overlapping()`

are sequential functions. That is to say that we apply an operation to each item in the array sequentially: we increment the first element in the array, then the next element and so on.

To optimize our solution, weâ€™ll need to use something beyond just pure Python.

NumPy is a Python wrapper over lower-level C and Fortran code that implements optimized and natively compiled routines, so weâ€™ll use that.

Here, weâ€™ll use NumPyâ€™s `add()`

vectorized `ufunc`

instead of sequentially adding one to each item in the `fabric`

array.

```
import numpy as np
Fabric = np.ndarray
def get_fabric(claims: Claims, length: int = SIDE_LENGTH) -> Fabric:
fabric = np.zeros((length, length), dtype=np.int8)
for x, y, width, height in claims:
x_indices = slice(x, x + width)
y_indices = slice(y, y + height)
indices = x_indices, y_indices
fabric[indices] += INCREMENT
return fabric
```

Our above solution will benefit from NumPyâ€™s memory efficient arrays, compared to our previous solution using Pythonâ€™s `list`

type.

Next, weâ€™ll use NumPyâ€™s `sum()`

routine to locate coordinates that specify values that are greater than 1.

```
def count_overlapping(fabric: Fabric) -> int:
return np.sum(fabric >= MIN_OVERLAP)
```

Letâ€™s see what kind of performance boost weâ€™ll get using our new solution.

```
%timeit get_fabric(claims)
fabric = get_fabric(claims)
%timeit count_overlapping(fabric)
```

Hereâ€™s our output on a relatively recent Intel processor:

```
12.8 ms Â± 883 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)
1.76 ms Â± 79.8 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)
```

Compared to our previous solution, our new `get_fabric()`

implementation is *22 times faster* and `count_overlapping()`

is *63 times faster*.

Not bad.

## Writing It in Rust

First, we can write the naive solution in Rust and cut the run time in half compared to our NumPy solution. Then, we can use the `ndarray`

crate to further optimize our Rust code along the lines of our NumPy code.

Hereâ€™s the naive solution in Rust.

```
use itertools::Itertools;
const NO_CLAIM: u8 = 0;
const INCREMENT: u8 = 1;
const MIN_OVERLAP: u8 = 2;
const SIDE_LENGTH: usize = 1000;
const START: usize = 0;
struct Claim {
x: usize,
y: usize,
width: usize,
height: usize,
}
type Claims = [Claim];
type Fabric = Vec<Vec<u8>>;
fn get_fabric(claims: &Claims) -> Fabric {
let mut fabric =
vec![vec![NO_CLAIM; SIDE_LENGTH]; SIDE_LENGTH];
for &Claim { x, y, width, height } in claims {
let x_indices = x..(x + width);
let y_indices = y..(y + height);
for (x, y) in x_indices.cartesian_product(y_indices) {
fabric[x][y] += INCREMENT;
}
}
fabric
}
fn count_overlapping(fabric: &Fabric) -> usize {
fabric.iter()
.flatten()
.filter(|&&item| item >= MIN_OVERLAP)
.count()
}
fn main() {
let claims: Claims; //snip
let fabric = get_fabric(&claims);
let overlapping = count_overlapping(&fabric);
}
```

We can cut the run time in half *again* with the rough translation of the NumPy code below, for a total run time of roughly 3ms.

```
use ndarray::prelude::*;
use ndarray::{Array, Ix2};
type Fabric = Array<u8, Ix2>;
fn get_fabric(claims: &Claims) -> Fabric {
let mut fabric =
Array::zeros((SIDE_LENGTH, SIDE_LENGTH).f());
for &Claim { x, y, width, height } in claims {
let x_indices = x..(x + width);
let y_indices = y..(y + height);
for pt in x_indices.cartesian_product(y_indices) {
fabric[pt] += INCREMENT;
}
}
fabric
}
fn count_overlapping(fabric: &Fabric) -> usize {
fabric.iter()
.filter(|&&item| item >= MIN_OVERLAP)
.count()
}
```