ARTICLES
Arrow Integration: The Game-Changing Feature of Pandas 2.0

Are you ready to take your data analysis experience to the next level? The revolutionary Pandas 2.0 is coming! Read this article to learn about the heavyweight new features, such as Arrow integration and the introduction of Copy-on-write. Deeply understand Pandas 2.0 and unlock the door to a new world of data analysis.

Kaisheng He
  30 March 2023

Pandas, the popular data manipulation library for Python, is about to release a major update with version 2.0. Looking back at the history of Pandas, it took over ten years from its birth as version 0.1 to reach version 1.0, which was released in 2020. Three years later, it is now receiving another major update.

If the release of version 1.0 meant that the Pandas DataFrame API was becoming stable, then the latest release notes indicate that Pandas 2.0 is a revolution in performance. This reminds us of Python’s creator Guido’s plans for Python, which include a series of PEPs focused on performance optimization. The entire Python community is striving towards this goal.

Arrow backend

One of the most notable features of Pandas 2.0 is its integration with Apache Arrow, a unified in-memory storage format. Before the integration of Arrow, Pandas uses Numpy as its memory layout. Each column of data was stored as a Numpy array, and these arrays were managed internally by BlockManager. However, Numpy itself was not designed for data structures like DataFrame, and there were some limitations with its support for certain data types, such as strings and missing values. In 2013, Pandas creator Wes McKinney gave a famous talk called “10 Things I Hate About Pandas”, most of which were related to performance, some of which are still difficult to solve. Four years later, in 2017, McKinney initiated Apache Arrow as a co-founder. This is why Arrow’s integration has become the most noteworthy feature, as it is designed to work seamlessly with Pandas. Let’s take a look at the improvements that Arrow integration brings to Pandas.

Missing values

Many users of Pandas have experienced a column’s data type changing from integer to float, because missing values were introduced in that column during calculation process or in original data. Pandas automatically converts the data type of that column to float.

In [1]: pd.Series([1, 2, 3, None])
Out[1]:
0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

Prior to version 2.0, within Pandas, different types of missing values were represented differently. For example, np.nan was used to represent missing values for floating-point numbers, None or np.nan was used to represent missing values for object types, and pd.NaT was used to represent missing values for date-related types. In Pandas 1.0, pd.NA was introduced to represent missing values for integer and boolean types, avoiding type conversion, but it needs to be specified manually by the user. It is clear that Pandas has always wanted to improve in this area but has struggled to do so. The introduction of Arrow can solve this problem perfectly. It eliminates the need for Pandas to have a separate implementation for each data type and the more suitable in-memory data structure reduces a lot of overheads. Currently, in Pandas 2.0, the type can be specified manually as pyarrow.

In [1]: df2 = pd.DataFrame({'a':[1,2,3, None]}, dtype='int64[pyarrow]')

In [2]: df2.dtypes
Out[2]:
a    int64[pyarrow]
dtype: object

In [3]: df2
Out[3]:
      a
0     1
1     2
2     3
3  <NA>

String type

The inefficient handling of string data has been a common criticism of Pandas. As mentioned earlier, Pandas previously used Numpy to represent data internally. However, Numpy was not designed for string processing and is primarily used for numerical calculations. Therefore, a column of string data in Pandas is actually a set of PyObject pointers, with the actual data scattered throughout the heap. This undoubtedly increases memory consumption and makes it unpredictable. When Pandas processes data, you need about 5 to 10 times the original data’s memory to work smoothly. This issue has become more severe as the amount of data increases.

Of course, Pandas attempted to address this issue in version 1.0 by supporting the experimental StringDtype extension, which uses Arrow string as its extension type. This was also a preparation for full support in version 2.0. Arrow, as a columnar storage format, stores data continuously in memory, including strings. When reading an entire column of strings, there is no need to get data through pointers, which can avoid various cache misses. These improvements can bring significant enhancements to memory usage and calculation.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '2.0.0rc1'

In [3]: df = pd.read_csv('pd_test.csv')

In [4]: df.dtypes
Out[4]:
name       object
address    object
number      int64
dtype: object

In [5]: df.memory_usage(deep=True).sum()
Out[5]: 17898876

In [6]: df_arrow = pd.read_csv('pd_test.csv', dtype_backend="pyarrow", engine="pyarrow")

In [7]: df_arrow.dtypes
Out[7]:
name       string[pyarrow]
address    string[pyarrow]
number      int64[pyarrow]
dtype: object

In [8]: df_arrow.memory_usage(deep=True).sum()
Out[8]: 7298876

As we can see, for a relatively small DataFrame, when read using the default behavior, it occupies around 17MB of memory. However, after specifying Arrow, the memory usage is reduced to less than 7MB. This advantage becomes even more significant for larger datasets. In addition to memory, let’s also take a look at the computational performance.

In [9]: %time df.name.str.startswith('Mark').sum()
CPU times: user 21.1 ms, sys: 1.1 ms, total: 22.2 ms
Wall time: 21.3 ms
Out[9]: 687

In [10]: %time df_arrow.name.str.startswith('Mark').sum()
CPU times: user 2.56 ms, sys: 1.13 ms, total: 3.68 ms
Wall time: 2.5 ms
Out[10]: 687

Count the number of people named Mark. Time for default behavior is about 10 times slower than Pandas with Arrow backend. This is an exciting result, as the more efficient memory layout in Pandas 2.0 brings greater benefits to computation. This is enough to make us excited and looking forward to Pandas 2.0.

Copy-on-Write

Copy-on-Write (CoW) is an optimization technique commonly used in computer science. Essentially, when multiple callers request the same resource simultaneously, CoW avoids making a separate copy for each caller. Instead, each caller holds a pointer to the resource until one of them modifies it. This ensures that other callers continue to see the original resource unchanged.

So, what does CoW have to do with Pandas? In fact, the introduction of this mechanism is not only about improving performance, but also about usability. Users who frequently work with Pandas may have seen the copy function in their code, which is used to copy an intermediate result. In some cases, every step in the code calls copy, which is related to the behavior of Pandas itself. Pandas functions return two types of data: a copy or a view. A copy is a new DataFrame with its own memory, and is not shared with the original DataFrame. A view, on the other hand, shares the same data with the original DataFrame, and changes to the view will also affect the original. Generally, indexing operations return views, but there are exceptions. Even if you consider yourself a Pandas expert, it’s still possible to write incorrect code here, which is why manually calling copy has become a safer choice.

In [1]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [2]: subset = df["foo"]

In [3]: subset.iloc[0] = 100

In [4]: df
Out[4]:
   foo  bar
0  100    4
1    2    5
2    3    6

In the above code, subset returns a view, and when you set a new value for subset, the original value of df changes as well. If you’re not aware of this, all calculations involving df could be significantly biased. To avoid issues caused by view, Pandas has several functions that force copying data internally during computation, such as set_index, reset_index, add_prefix, and others. However, this can lead to performance issues. Let’s take a look at the behavior under the CoW strategy:

In [5]: pd.options.mode.copy_on_write = True

In [6]: df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

In [7]: subset = df["foo"]

In [7]: subset.iloc[0] = 100

In [8]: df
Out[8]:
   foo  bar
0    1    4
1    2    5
2    3    6

After turning on copy_on_write, rewriting subset data triggers a copy, and modifying the data only affects subset itself, leaving the df unchanged. This behavior is more intuitive, and allows some functions to make more in-place operations, avoiding the overhead of copying. The Pandas documentation provides a detailed explanation of CoW. In summary, users can safely use indexing operations without worrying about affecting the original data. This feature systematically solves the somewhat confusing indexing operations and provides significant performance improvements for many operators.

One more thing

When we take a closer look at Wes McKinney’s talk, “10 Things I Hate About Pandas”, we’ll find that there were actually 11 things, and the last one was “No multicore/distributed algos.”

The Pandas community has chosen to focus on improving single-machine performance for now. From what we’ve seen so far, Pandas is entirely trustworthy. The integration of Arrow makes it so that competitors like Polars will no longer have an advantage. On the other hand, the Python community has also been advancing in the field of distributed computing. Xorbits Pandas, for example, has rewritten most of the Pandas functions with parallel manner. This allows Pandas to utilize multiple cores, machines, and even GPUs to accelerate DataFrame calculations. With this capability, even data on the scale of 1 terabyte can be easily handled. Below are the test results for TPC-H 1T data volume.

Pandas 2.0 has given us great confidence. As a framework that introduced Arrow as a storage format early on, Xorbits can better cooperate with Pandas 2.0, and we will work together to build a better DataFrame ecosystem. In the next step, we will try to use Pandas with arrow backend to speed up Xorbits Pandas and share the latest results in the next article.

Conclusion

Pandas 2.0 is an exciting update that addresses many of the commonly mentioned issues with Pandas.
This article mainly introduces the integration of Arrow and the introduction of Copy-on-Write. In addition, the emergence of distributed Pandas libraries such as Xorbits has made the entire Pandas ecosystem more complete.


© 2022-2023 Xprobe Inc. All Rights Reserved.