Python Lists Vs. NumPy Arrays: A Deep Dive into Memory Layout and Performance Benefits

Exploring allocation differences and efficiency gains

Python Lists Vs. NumPy Arrays: A Deep Dive into Memory Layout and Performance Benefits
Data in NumPy arrays are arranged as compactly as books on a shelf. Photo by Eliabe Costa on Unsplash

In this article, we will delve into the memory design differences between native Python lists and NumPy arrays, revealing why NumPy can provide better performance in many cases.

We will compare data structures, memory allocation, and access methods, showcasing the power of NumPy arrays.


Introduction

Imagine you are preparing to go to the library to find a book. Now, you discover that the library has two shelves:

The first shelf is filled with various exquisite boxes, some containing CDs, some containing pictures, and others containing books. Only the name of the item is attached to the box.

This represents native Python lists, where each element has its memory space and type information.

However, this approach has a problem: many empty spaces in the boxes, wasting shelf space. Moreover, when you want to find a specific book, you must look inside each box, which takes extra time.

Now let’s look at the second shelf. This time there are no boxes; books, CDs, and pictures are all compactly placed together according to their categories.

This is NumPy arrays, which store data in memory in a continuous fashion, improving space utilization.

Since the items are all grouped by category, you can quickly find a book without having to search through many boxes. This is why NumPy arrays are faster than native Python lists in many operations.


Python Lists: A Flexible but Less Efficient Solution

Everything in Python is an object

Let’s start with the Python interpreter: although CPython is written in C, Python variables are not basic data types in C, but rather C structures that contain values and additional information.

Take a Python integer x = 10_000 as an example, x is not a basic type on the stack. Instead, it is a pointer to a memory heap object.

If you delve into the source code of Python 3.10, you’ll find that the C structure that x points to is as shown in the following figure:

Python integer vs. C native integer.
Python integer vs. C native integer. Image by Author.

The PyObject_HEAD contains information such as reference count, type information, and object size.

Python lists are objects containing a series of objects

From this, we can deduce that a Python list is also an object, except that it contains pointers to other objects.

We can create a list that contains only integers:

integer_list = [1, 2, 3, 4, 5]

We can also create a list that includes multiple object types:

mixed_list = [1, "hello", 3.14, [1, 2, 3]]

Pros and cons of Python lists

As we can see, Python lists contain a series of pointer objects. These pointers, in turn, point to other objects in memory.

The advantage of this approach is flexibility. You can put any object in a Python list without worrying about type errors.

However, the downside is also evident:

Python lists contain a series of pointer objects.
Python lists contain a series of pointer objects. Image by Author

The objects pointed to by each pointer are scattered in memory. When you traverse a Python list, you need to look up the memory location of each object based on the pointer, resulting in lower performance.


NumPy Arrays: A Contiguous Memory Layout for Enhanced Performance

Next, let’s explore the components and arrangement of NumPy arrays, and how it benefits cache locality and vectorization.

NumPy arrays: structure and memory layout

According to NumPy’s internal description, NumPy arrays consist of two parts:

  1. One part stores the metadata of the NumPy array, which describes the data type, array shape, etc.
  2. The other part is the data buffer, which stores the values of array elements in a tightly packed arrangement in memory.
NumPy arrays: structure and memory layout.
NumPy arrays: structure and memory layout. Image by Author

Memory layout of NumPy arrays

When we observe the .flags attribute of a ndarray, we find that it includes:

 In 1:  np_array = np.arange(6).reshape(2, 3, order='C')
        np_array.flags

Out 1:  C_CONTIGUOUS : True
        F_CONTIGUOUS : False
        OWNDATA : False
        WRITEABLE : True
        ALIGNED : True
        WRITEBACKIFCOPY : False
  • C_CONTIGUOUS, which indicates whether the data can be read using row-major order.
  • F_CONTIGUOUS, which indicates whether the data can be read using column-major order.

Row-major order is the data arrangement used by the C language, denoted by order=’C’. It means that data is stored row by row.

Column-major order, on the other hand, is used by FORTRAN, denoted by order=’F’, and stores data column by column.

Memory layout of NumPy arrays.
Memory layout of NumPy arrays. Image by Author

Advantages of NumPy’s memory layout

Since ndarray is designed for matrix operations, all its data types are identical, with the same byte size and interpretation.

This allows data to be tightly packed together, bringing advantages in cache locality and vectorized computation.


Cache Locality: How NumPy’s Memory Layout Improves Cache Utilization

What is CPU cache

NumPy’s contiguous memory layout helps improve cache hit rates because it matches how CPU caches work. To better explain this, let’s first understand the basic concept of CPU cache.

A CPU cache is a small, high-speed storage area between the CPU and main memory (RAM). The purpose of the CPU cache is to speed up data access in memory.

A CPU cache is a small, high-speed storage area between the CPU and main memory (RAM).
A CPU cache is a small, high-speed storage area between the CPU and main memory (RAM). Image by Author

When the CPU needs to read or write data, it first checks if it is already in the cache.

The CPU can read directly from the cache if the required data is in the cache (cache hit). If the data is not present (cache miss), the CPU loads the data from RAM and stores it in the cache for future use.

CPU caches are usually organized in cache lines, which are contiguous memory addresses. When the CPU accesses RAM, the cache loads the entire cache line into the high-speed cache.

This means that if the CPU accesses neighboring memory addresses, subsequent accesses are more likely to hit the cache after loading a cache line, thus improving performance.

When the CPU accesses RAM, the cache loads the entire cache line into the high-speed cache.
When the CPU accesses RAM, the cache loads the entire cache line into the high-speed cache. Image by Author

How NumPy utilizes cache

NumPy’s contiguous memory layout takes advantage of this fact.

NumPy arrays store data in continuous memory addresses, which helps improve cache locality.

When accessing an element in the array, the entire cache line (containing neighboring array elements) is loaded into the cache.

As you traverse the array, you access each element in a sequence. Because array elements are stored contiguously in memory, cache hits are more likely during the traversal, improving performance.

This is similar to going to the library to read a book. Instead of just grabbing the book you need, you also take out related books and place them on the table.

This way, when you need to consult associated materials, they’re easily accessible and more efficient than searching the shelves.


Vectorization: Unleashing the Power of NumPy’s Memory Layout

What is vectorization

Vectorization is a technique that leverages the Single Instruction Multiple Data (SIMD) features of CPUs or GPUs to perform multiple data operations simultaneously.

Vectorized operations can significantly improve code execution efficiency by simultaneously processing multiple data items.

NumPy’s contiguous memory layout facilitates vectorized operations.

Why is vectorization suitable

Consider yourself a delivery person who must deliver packages to various households daily.

Suppose the packages are arranged sequentially in the vehicle, and the houses are numbered along the street. In that case, the delivery person can efficiently deliver packages along the street in order.

This efficient method is analogous to NumPy’s memory layout, which brings the following benefits in vectorization:

  • Data alignment: NumPy arrays’ contiguous memory layout ensures that data is aligned in memory in a vectorization-friendly manner. This allows the CPU to efficiently load and process data in NumPy.
  • Sequential access pattern: NumPy’s tightly packed data in memory helps improve vectorization performance. The sequential access pattern also takes full advantage of CPU cache and prefetching, reducing memory access latency.
  • Simplified code: NumPy offers a range of functions (e.g., np.add, np.multiply) and operations (e.g., array slicing) that automatically handle vectorized operations. You can write concise and efficient code without worrying about the underlying implementation.

Copies and Views: NumPy’s Excellent Design for Performance Optimization

In the earlier discussion, we discussed how NumPy leverages its contiguous memory layout to achieve performance advantages.

Now, let’s discuss how NumPy gains performance benefits through copies and views.

What are copies and views

Copies and views are two options that define the relationship between existing data and the original array. Based on the characteristics of these two options, they can be summarized as follows:

  • Copy: Uses a different memory space than the original array, but the data content is the same.
  • View: References the same memory address as the original array.
A copy may have multiple views.
A copy may have multiple views. Image by Author

If we compare this to a book, a view is like a bookmark placed in the book, without creating a copy of the book.

On the other hand, a copy is a duplicate of the original book, containing a complete replica, including text and images. When you add notes to this copy, the original book remains unaffected.

Making good use of both features

Utilizing the characteristics of views and copies can help us write concise and efficient code.

Let’s take arithmetic operations as an example. A new copy will be created if you use a = a + 1. However, if you use a += 1 or np.add, broadcasting is applied, and the addition is performed directly on the original array.

Consider the following code, which measures the execution time:

import numpy as np
import time


def test():
    x = np.ones(100_000_000, dtype='int8')
    y = np.ones(100_000_000, dtype='int8')

    start = time.monotonic()
    for _ in range(100):
        x = x * 4 + y * 3
        x = np.ones(100_000_000, dtype='int8')
    print(f'x = x * 4 + y * 3: needs {time.monotonic() - start:.2f} seconds.')

    start = time.monotonic()
    for _ in range(100):
        x *= 4
        x += y * 3
        x = np.ones(100_000_000, dtype='int8')
    print(f'x *= 4; x += y * 3: needs {time.monotonic() - start:.2f} seconds.')

    start = time.monotonic()
    for _ in range(100):
        np.multiply(x, 4, out=x)
        np.multiply(y, 3, out=y)
        np.add(x, y, out=x)
        x = np.ones(100_000_000, dtype='int8')
        y = np.ones(100_000_000, dtype='int8')
    print(f'use functions needs {time.monotonic() - start:.2f} seconds.')


if __name__ == "__main__":
    test()

Executing the above code will yield a result similar to the following:

Using views for calculations takes less time.
Using views for calculations takes less time. Screenshot by Author

As we can see from the results, using views for calculations takes less time.

Distinguishing between copies and views

Confirming whether the result is a view or a copy every time a calculation is performed would require much effort.

However, there’s a more straightforward way to verify this:

  • Use may_share_memory to determine if the two arrays in the argument refer to the same memory space. This judgment may not be strict. True doesn’t necessarily mean the arrays are shared, but False confirms that the arrays are definitely not shared.
  • If you need a more accurate answer, you can use the share_memory function. However, this function takes longer to execute than may_share_memory.

Conclusion

In summary, we have learned about the differences in memory arrangement between NumPy arrays and native Python lists.

Due to the contiguous arrangement of the same data type in NumPy’s array, significant performance advantages are achieved in both Cache Locality and Vectorization.

Separating views and copies in NumPy’s design provides greater flexibility for code execution performance and memory management.

In the upcoming series of articles, I will start from the basics and reiterate the best practices for data science at work. If you have any suggestions or questions, please feel free to comment, and I will address them individually.


🎉
Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome—let's discuss in the comments below!