Revolutionize Text Deduplication in Large Language Models with Xorbits

This article highlights the importance of text deduplication and introduces Xorbits' scalable solution to this problem.

Peiyuan Liu
  19 July 2023


In the realm of Large Language Models (LLMs), text deduplication is integral. It enhances training efficiency while mitigating privacy issues12. LLMs rely heavily on extensive, diverse datasets, many of which are rife with duplicated sequences scraped from the internet, exacerbating data redundancy and privacy risks34. Text deduplication, by speeding up LLM training and lowering the risk of data memorization and privacy breaches125, offers a powerful solution to these issues. Our innovation centers around the distributed text deduplication facilitated by the Xorbits framework. Much like Dask, Ray, and Modin, Xorbits allows large-scale parallel processing across multiple machines, thereby offering enhanced speed and capacity. With Xorbits for text deduplication, we can process extensive text datasets beyond the capabilities of single-machine processing at a higher efficiency, marking a significant advancement in LLM data handling.

It’s worth noting that Xorbits streamlines the scalability of data science and machine learning workloads, covering everything from data preprocessing to model serving. Moreover, it leverages multi-cores or GPUs to expand across thousands of machines, facilitating the processing of terabytes of data or training of large models. Xorbits’ efficient deduplication interface, dedup, fills a vital gap in the market by offering a premium API for text deduplication. With exceptional performance in both single and multi-machine environments, Xorbits greatly eases large-scale text deduplication, making it a tool of choice for those dealing with extensive language datasets.

Large Language Datasets and The Challenge of Deduplication

There are numerous substantial language datasets available in the field, including the Oscar, BigCode, among others. These datasets are massive, with billions of records. The sheer size of these datasets makes the deduplication process a daunting task, especially if we are trying to do it on a single machine.

The challenge with such a large dataset is twofold: first, loading the entire dataset into memory for deduplication is not feasible due to the hardware limitations of a single machine; second, even if the dataset is small enough to fit into memory, the computational cost for deduplication could be extremely high and the process could take a prohibitively long time.

Xorbits Deduplication Interface

To tackle this problem, we developed the Xorbits deduplication interface, which uses a distributed multi-node approach for text deduplication. With this interface, we can split the dataset across multiple nodes and process them in parallel. This not only allows us to handle larger datasets, but also significantly accelerates the deduplication process.

Here’s the rough flowchart for text de-duplication:

The application of the Xorbits text de-duplication interface is straightforward, as shown below:

# To install xorbits, run `pip install xorbits`

import xorbits.pandas as pd
from xorbits.experimental import dedup

df = pd.DataFrame(...) # Assume df is your text dataset, with text content in column "text"

res = dedup(df, col="text")

For additional function parameters, please refer to dedup.

Here is the details of our algorithm6:

  1. To de-duplicate a dataframe, we append a unique identifier column, __dedup_id.

  2. We employ the embed_func function to generate hash values. This function breaks down text content into N-grams, applies a hash function to generate a series of signatures __signatures, and maps these signatures to their corresponding unique identifiers. Each signature comprises a tuple containing location information of the text line’s hash values based on the num_perm parameter.
     def embed_func(
         row: pd.Series,
         text: str,
         num_perm: int,
         ngram_size: int,
         min_length: int,
         hashranges: List[Tuple[int, int]],
         permutations: np.ndarray,
     ) -> pd.Series:
         content, idx = row[text], row["__dedup_id"]
         a, b = permutations
         masks: np.ndarray = np.full(shape=num_perm, dtype=np.uint64, fill_value=MAX_HASH)
         tokens: Set[str] = {
             " ".join(t) for t in ngrams(NON_ALPHA.split(content), ngram_size, min_length)
         hashvalues: np.ndarray = np.array(
             [sha1_hash(token.lower().encode("utf-8")) for token in tokens], dtype=np.uint64
         permuted_hashvalues = np.bitwise_and(
             ((hashvalues * np.tile(a, (len(hashvalues), 1)).T).T + b) % MERSENNE_PRIME,
         hashvalues = np.vstack([permuted_hashvalues, masks]).min(axis=0)
         Hs = [
             (i, bytes(hashvalues[start:end].byteswap().data))
             for i, (start, end) in enumerate(hashranges)
         return pd.Series({"__signatures": Hs, "__id": idx})
  3. Leveraging this embedded function, we calculate the embedded dataframe using the apply operand.
     embedded = in_df_with_id.apply(
             dtypes=pd.Series(["object", "bytes"], index=["__signatures","__id"]),
  4. Utilizing the Locality Sensitive Hashing (LSH) method - a process that enables approximate nearest neighbor search by representing high dimensional data in a reduced dimensional space - necessitates us to “explode” the __signatures column to visualize the position data of each text slice. We then group and set to cluster hash values.
     clusters = (
             .groupby("__signatures", sort=False)["__id"]
  5. Next, we construct the Union Find. All preceding dataframe operations can be distributed, with each chunk calculating its concatenation set. However, a global concatenation set is necessary for de-duplication, requiring us to merge all sets after calculation. As this is the only non-distributable step in our deduplication algorithm, we use a Cython-based concatenation set to expedite computation.
     for cluster in clusters:
         if len(cluster) <= 1:
         idx = min(cluster)
         for x in cluster:
             op.union_find.union_(x, idx)
  6. Upon global union find construction, we de-duplicate the original text using and finally drop signatures column __dedup_id.
     res = input_data[
             input_data["__dedup_id"].map(lambda x: op.union_find.find(x) == x)

Here is an animation that neatly summarizes the above steps:


We subjected our algorithm to rigorous testing on oscar2201’s Chinese dataset, utilizing the formidable computational power of AWS’s r5a.16xlarge machine, which boasts 64 cores and 512GB memory. The results are enumerated below:

Size Pre-Deduplication Rows Post-Deduplication Rows Time
1GB 68,000 62,250 50s
10GB 666,000 559,420 172s
50GB 3,120,337 2,387,421 572s
100GB 6,292,027 4,603,959 1,073s

We further pushed the boundaries by running distributed tests on the same dataset. This time, we employed AWS’s r6i.8xlarge machine, which features 32 cores and 256GB memory. The figures recorded were as follows:

Size Pre/Post-Deduplication Rows Node Num Time
500GB 31,385,092 / 20,380,655 5 1,174s
500GB - Same - 7 938s
500GB - Same - 10 683s
500GB - Same - 15 550s

The results unambiguously demonstrate a robust deduplication speed that scales adeptly with an increase in node counts. However, we do observe a tapering in performance improvement as we move from 10 to 15 nodes. This marginal slowdown can be attributed to the inherent algorithmic challenge posed by merging local disjoint-sets from each chunk for text deduplication. While this minhash-based operation is necessary on a global scale, its execution in a distributed setting isn’t feasible, thereby introducing a ceiling to time reduction at larger node counts.

It’s worth noting that harnessing the power of multiple nodes with xorbits is a relatively straightforward task. For those who are interested in implementing such a solution, a comprehensive guide is available at the following link: cluster deployment. We hope that this will serve as a valuable resource in your deduplication journey.

Future Work

In our existing architecture, deduplication requires the dataset to be stored locally or on a cloud server, accessed via xorbits.pandas.read_csv or read_parquet. While proficient, we continually aspire to improve user experience and performance. We’re excited to reveal that we’re integrating our system with Hugging Face’s datasets - a robust library for large-scale machine learning datasets, streamlining and accelerating data reading procedures. Not stopping there, we aim to diversify our text feature extraction techniques beyond n-grams, to include methods like TF-IDF. Additionally, we plan to introduce alternative hashing methodologies, such as SimHash, further enhancing the flexibility and effectiveness of our deduplication workflow.

If you’re as thrilled as we are about this upcoming improvement, stay tuned!


The importance of scalable text deduplication cannot be overstated, given the expanding size of language datasets. Xorbits fills a vital gap with its easy-to-use pandas DataFrame deduplication interface, enabling both single and multi-node deduplication for large datasets, thus resolving performance issues of standalone machines. We encourage users seeking efficient and effective text deduplication to adopt Xorbits, an indispensable solution for data scientists and ML engineers.


1: Nikhil Kandpal, Eric Wallace, Colin Raffel, Deduplicating Training Data Mitigates Privacy Risks in Language Models, 2022

2: Katherine Lee, Daphne Ippolito, et al., Deduplicating Training Data Makes Language Models Better, 2022

3: Abadji Julien, Ortiz Suarez Pedro, et al. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus, 2022

4: Bharat Ramanathan, Processing Data for Large Language Models, 2022

5: Gowthami Somepalli, et al., Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models, 2022

6: Some code adapted from

© 2022-2023 Xprobe Inc. All Rights Reserved.