If you use Numpy for your everyday number crunching and you’ve never heard about views, you are not as effective as you could be. With this post, I want to help you take your Numpy skills to the next level.
Your mental model of what arrays are in Numpy is probably quite different from the underlying reality. To get you to the paradigm shift, we need a problem that can bring us down the rabbit hole. This is the windowing problem:
I have an input time series with observations. I want to rearrange it in windows of some size and apply a function to each window, to get an output array of values. How can I do this fast and with a small memory footprint using Numpy?
Or – more visually – how to use Numpy to do something like the following, while keeping fast and small?
Two naive solutions to windowing end up being either too slow or too memory-hungry.
Specifically, using a pure Python
for loop to iterate window-by-window over the original array is going to be slow; Creating a new windowed array from the old one requires up to times the amount of memory used by the old array. Both of these approaches could work for a one-off script on a small dataset, but they become unfeasible as soon as the data size increases, or our data transformation needs to be called repeatedly in some data processing pipeline.
Data views in numpy
A numpy array is a pointer to a contiguous chunk of memory, bundled with information on how to interpret that memory. It turns out that you can modify this information to reinterpret the same contiguous chunk of memory, and see it in a different way. When you do that, you are creating a view. 1 The original array is then akin to a “canonical” interpretation of the underlying memory, and it owns that memory. A view reinterprets the canon, but it doesn’t own its memory – it shares it with the original array.
The interpretation information is contained in key
- data: a pointer to the chunk of memory
- dtype: data type. Is each byte in the memory chunk representing a boolean, an 8-bit integer, or perhaps is part of the 8 bytes forming a 64-bit floating point number?
- itemsize: number of bytes needed to represent a single element, e.g. 8 for a 64-bit float. This is included for clarity, as it is actually completely defined by dtype.
- shape: the size of the array in each of the dimensions. For example, a 2D array with rows and columns would have
- strides: for each dimension, the number of bytes to jump ahead to get to the next element in that dimension. More about this later.
Here’s an example of how this information comes together to form an interpretation of some memory chunk:
By modifying the interpretation information bundled with every numpy array, we can keep the same underlying memory contents, yet
- see the memory as containing one type of data or another (e.g. integers, floats, booleans, or arbitrary python objects);
- decide how to move through this memory, ignoring some parts of it and only accessing some other;
Numpy’s view approach is extremely powerful. To illustrate why, say I have a 1GB array
big_arr, and I want to create another array
smaller_arr from it, containing all elements in the second half of
big_arr. By creating a view of the original
big_arr, numpy will just reuse the same memory from
big_arr instead of copying half of its contents to some newly allocated 500MB memory chunk.
>>> bytes_in_gigabyte = 1e9 >>> # This increases RAM usage by 1GB (check with a RAM monitor) >>> big_arr = np.ones(int(bytes_in_gigabyte), dtype=np.int8) >>> # No big increase in RAM for smaller_arr >>> smaller_arr = big_arr[big_arr.size/2:]
This is interesting and all, but – you may be wondering – what does this have to do with windowing? Let’s talk about
Moving through memory
As I wrote before, a numpy array contains information about how to move through the memory chunk which the array points to.
strides tuple is the answer to the question “If I am moving along dimension , how many bytes of memory do I have to skip to get to the next element?”. For example, for the array of 1-byte integers in the figure with 2 rows and 4 columns,
strides would be equal to
(4, 1), as if I am moving along dimension 0 (i.e. along rows), I need to skip 4 bytes of data to get to the next element from the current one, and to skip 1 byte to get to the next element if I am moving along dimension 1 (i.e. columns). An animation should help clarifying the concept:
We can modify
strides to get a memory-efficient windowed view of an array, and be able to apply fast numerical transformations to it. This is because we can rely on Numpy’s fast vectorization to apply functions to our windowed array rather than having to use slow pure Python.
Windowing with strides
Let’s take the chunk of memory in the previous figure, interpreted as a sequence of 1-byte integers:
[1, 3, 3, 7, 8, 0, 0, 8]
Say we want to window this such that each window contains 2 elements, with the windowing advancing by one element at a time; this would result in the following array:
[[1, 3], [3, 3], [3, 7], [7, 8], [8, 0], [0, 0], [0, 8]]
If we were to create the windowed array from scratch, we would end up with an array roughly 2 times as big as the original one. As I wrote before, this is not practical when one has a big array to begin with, and the problem only grows worse as the windowing size increases.
Luckily, we can manipulate the
shape information to move intelligently through the original chunk of memory, simulating windowing but without the memory cost. Remember,
strides contains, for each dimension, the number of bytes to jump ahead to get to the next element in that dimension.
For the chunk of memory in the example, we can iterate over it as if it were windowed, by setting its shape to be
(7, 2), and specifying that the first dimension always advances by one element at a time (one byte in the case of a int8 element). In this way, every time we are finished looking at a row (a window), we advance to the next one by moving a byte forward from the beginning of the current row, and see the next couple of elements as the next row.
Interestingly, Numpy won’t actually just allow us to change strides and shape without putting up a fight, so we need to use the
as_strided function under
numpy.lib.slide_tricks, which circumvents numpy’s safety checks to create an array on the original data with our desired shape and strides.
>>> import numpy as np >>> arr = np.array([1, 3, 3, 7, 8, 0, 0, 8], dtype=np.int8) >>> np.lib.stride_tricks.as_strided(arr, strides=(1, 1), shape=(7,2)) array([[1, 3], [3, 3], [3, 7], [7, 8], [8, 0], [0, 0], [0, 8]], dtype=int8)
A word of warning: bypassing numpy’s checks on shape and strides, we can do some serious damage. For example, what happens when the new shape makes the view “spill over” the original memory chunk?
>>> np.lib.stride_tricks.as_strided(arr, strides=(1, 1), shape=(10,2)) array([[ 1, 3], [ 3, 3], [ 3, 7], [ 7, 8], [ 8, 0], [ 0, 0], [ 0, 8], [ 8, -16], [ -16, -108], [-108, 30]], dtype=int8)
That’s strange. What are those weird values starting from the 8th row? We never defined those. Those value are memory we never allocated for our original array, and that we were not supposed to access. Who knows who is using it and what it actually contains.
To avoid getting garbage in our arrays and messing with memory that is not ours to mess with, we must be sure to compute shapes and strides properly while windowing.
Efficient time series windowing in Numpy
The following function contains all the needed logic to make sure we can window efficiently and safely:
from __future__ import division import numpy as np def sliding_window(arr, window_size, step=0): """Assuming a time series with time advancing along dimension 0, window the time series with given size and step. :param arr : input array. :type arr: numpy.ndarray :param window_size: size of sliding window. :type window_size: int :param step: step size of sliding window. If 0, step size is set to obtain non-overlapping contiguous windows (that is, step=window_size). Defaults to 0. :type step: int :return: array :rtype: numpy.ndarray """ n_obs = arr.shape # validate arguments if window_size > n_obs: raise ValueError( "Window size must be less than or equal " "the size of array in first dimension." ) if step < 0: raise ValueError("Step must be positive.") n_windows = 1 + int(np.floor( (n_obs - window_size) / step)) obs_stride = arr.strides windowed_row_stride = obs_stride * step new_shape = (n_windows, window_size) + arr.shape[1:] new_strides = (windowed_row_stride, ) + arr.strides strided = np.lib.stride_tricks.as_strided( arr, shape=new_shape, strides=new_strides, ) return strided
Let’s go through this bit by bit.
The first few lines are not particularly interesting, we just validate the passed arguments to make sure they make sense.
n_obs = arr.shape # validate arguments if window_size > n_obs: raise ValueError( "Window size must be less than or equal " "the size of array in first dimension." ) if step < 0: raise ValueError("Step must be positive.")
Subsequently, we count the number of windows that the windowed array will end up containing.
n_windows = 1 + int(np.floor( (n_obs - window_size) / step))
To understand the equation, think about this way. We place a window at the beginning of the array. How many times we can slide this window forwards by
step before the window goes over the array bounds? That’s equal to the number of times we can fit a step in
n_obs - window_size, i.e.
floor( (n_obs - window_size)/step ). For a more visual intuition:
We are finally ready to create the windowed view. Whereas the old array was organized as a sequence of observations, the view is organized as a sequence of windows, each one containing
window_size observations. We need to update the
shape metadata to reflect this change in dimensionality, as well as
strides, to let Numpy know how many bytes it needs to skip to go from one window to the next.
obs_stride = arr.strides windowed_row_stride = obs_stride * step new_shape = (n_windows, window_size) + arr.shape[1:] new_strides = (windowed_row_stride, ) + arr.strides strided = np.lib.stride_tricks.as_strided( arr, shape=new_shape, strides=new_strides, writeable=False, ) return strided
And we are done. Let’s see it in action.
>>> big_arr = np.ones(int(1e9), dtype=np.int8)
big_arr is an array of size ~1GB.
>>> sl_arr = sliding_window(big_arr, window_size=1000, step=1) >>> out = np.empty(sl_arr.shape, dtype=np.int8) >>> np.min(sl_arr, axis=1, out=out)
If we were to actually create the windowed array from scratch, we would need ~1TB worth of memory to accommodate it. Thanks to the power of views, the new array only needs enough memory to contain the interpretation information (11 orders of magnitude less), and is created instantly.
Numpy is great at hiding unnecessary complexity (such as
strides or memory bounds) from day-to-day work. Yet, from time to time, it is good to open the clock and see what makes it tick. Windowing is the problem that had me open the clock. In this post, I used the windowing problem to abuse Numpy’s views a little and show you their inner workings, the same way I learned about them. Hope you enjoyed it.
In case you came here through a specific search for efficient windowing in Numpy, I can point you to some other useful resources.
Throughout this post, windowing was only applied to the first dimension of an array. This is because I assumed a time series as input, where iteration over subsequent observations is normally done on the first array dimension. That is,
array[i] contains the value of some measurement(s) at time . If time series are not where you spend most of your number crunching, then your windowing needs might be different. You might want to have a look at a more general windowing implementations, such as this one or this one.
If you enjoyed this post, I’d be grateful if you’d help it spread by sharing it with your fellow data crunchers and numeric Python enthusiasts. I’d also love to read what you think, so go on and make a blogger happy – drop me a comment!
PYTHON · NUMPY · DATA WRANGLING · PROGRAMMING