Tutorial: Build an Article Recommendation System with Xorbits and Pinecone

TUTORIALS

This article illustrates how to build an article recommendation engine by leveraging Xorbits for generating embedding vectors and utilizing Pinecone as a vector database.

Zheng Zhou

9 June 2023

Introduction

The emergence of AI tools such as ChatGPT, Midjourney, and Copilot has clearly demonstrated the vast potential of AI programs and services across various fields. But how does a machine, which fundamentally understands only binary code (0 and 1), appear to possess such extensive knowledge, albeit with its fair share of flaws?

The answer lies in the way we humans feed information to these machines. Although machines lack the human ability to perceive the world, they can process and understand data that we provide in the form of numerical vectors. This is where embeddings come into play.

Source: An Introduction to the Power of Vector Search for Beginners

An embedding is essentially the depiction of entities such as words as vectors within a multi-dimensional environment. In other terms, it’s a sequence of numbers like [.89, .65, .45, ...]. This presents an intelligent method for converting categorical entities into numeric representations, allowing us to interpret them in various ways, such as assessing their similarity by organizing them in a multi-dimensional space.

Let’s consider the example of article recommendation. Articles are represented as vectors, and embeddings of these vectors are created. The distance between two vectors signifies their similarity. These embeddings are stored in a database, which can be searched based on specific requirements, such as article titles. This search process is akin to executing a SQL WHERE clause with a LIKE command. The AI then outputs the vector embeddings that most closely match the search criteria, thereby providing relevant article recommendations.

In the subsequent steps, we’ll employ Xorbits to generate vector embeddings from articles. Following that, we’ll utilize Pinecone to store these embeddings and retrieve the results. Xorbits is a high-performance distributed framework designed for data science, and it’s fully compatible with existing libraries like pandas. To use Xorbits, users simply need to substitute import pandas with import xorbits.pandas. Pinecone, on the other hand, is a vector database specifically designed for storing and querying high-dimensional vectors.

Install Packages

!pip install --quiet wordcloud xorbits
!pip install --quiet sentence-transformers --no-cache-dir
!pip install --quiet pinecone-client==2.2.1

Download Sources

The dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020.

!rm all-the-news-2-1.zip
!rm all-the-news-2-1.csv
!wget https://www.dropbox.com/s/cn2utnr5ipathhh/all-the-news-2-1.zip -q --show-progress
!unzip -q all-the-news-2-1.zip

Prepare Data & Create Vector Embeddings with Xorbits

We opt for Xorbits as a substitute for Pandas to assist us in text preprocessing and data encoding to generate vector embeddings using model. The data for the articles is imported in the form of a CSV-formatted data frame, with an additional column created to represent the vector of each article. The embedding vector for each article is then generated, taking into account the information from both the title and article columns. The model used in this example is Average Word Embeddings Models shown below.

from sentence_transformers import SentenceTransformer
import xorbits.pandas as pd
import re
import torch

# Set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('average_word_embeddings_komninos', device=device)

def prepare_data(data) -> pd.DataFrame:
    'Preprocesses data and prepares it for upsert.'
    # Add an id column
    print("Preparing data...")
    data["id"] = range(len(data))

    # Extract only first few sentences of each article for quicker vector calculations
    # We made a slight optimization when generating data['article']    
    # Original version:
    # data['article'] = data.article.apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:4]))

    # Optimized version:
    def helper(x):
        count = 0
        idx = -1
        x = str(x)
        for match in re.finditer(r'(?<=[.:;])\s', x):
            count += 1
            if count == 4:
                idx = match.start() if match.start() != match.end() else match.start() + 1
                break
        return x[:idx]

    data['article'] = data['article'].fillna('')
    data['article1'] = data.article.apply(helper)
    
    s2 = data['title'].str.cat(list(data['article1']), sep="")
    
    print('Encoding articles...')
    encoded_articles = s2.apply(lambda x : list(map(float, model.encode(x))))
    data['article_vector'] = encoded_articles
    return data

# Use part of the dataset because using the complete dataset may require more time
data = pd.read_csv('all-the-news-2-1.csv', nrows=200000)
prepared_data = prepare_data(data)

In the following comparison, we’ll evaluate the performance of Xorbits and Pandas in reading CSV files, processing the text, and creating embeddings. The following chart illustrates these comparisons across data frames with different row counts (the lower the better). It becomes clear that Xorbits has a distinct advantage over Pandas in terms of data preprocessing speed, especially when handling data frames with over 500k rows.

Therefore, when there’s a substantial number of articles to vectorize, opting for Xorbits will invariably lead to superior performance and time efficiency.

Insert Articles into the Pinecone Index

Next, we’ll proceed to index the encoded data into Pinecone. We’ll need to initialize Pinecone and create the index in Pinecone first. You can get a Pinecone API key here.

import pinecone

pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"
)

pinecone.create_index('articles-recommendation', dimension=300)
index = pinecone.Index('articles-recommendation')

BATCH_SIZE = 200 # batch size for upserting

def upload_items(data):
    'Uploads data in batches.'
    print("Uploading items...")
    data = data.astype({'id': 'str'})
    items_to_upload = list(zip(list(data['id']),list(data['article_vector'])))

    for i in range(0, len(items_to_upload), BATCH_SIZE):
        index.upsert(vectors=items_to_upload[i:i+BATCH_SIZE])

upload_items(prepared_data)

Query the Index for Similar Articles

We will query the index for the specific users. The users are defined as a set of the articles that they previously read. More specifically, we will define 10 articles for a user, and based on the article embeddings, we will define a unique embedding for the user. For example, we will use 10 articles about tennis-related sports news to define the user. We average those article vectors to get the user embedding. Then we use the user embedding to query Pinecone to get recommendation for that user.

from statistics import mean

sport_user = uploaded_data.loc[((uploaded_data['section'] == 'Sports News' ) | 
                                (uploaded_data['section'] == 'Sports')) &
                                (uploaded_data['article'].str.contains('Tennis'))][:10]

display(sport_user[['title', 'article', 'section', 'publication']])

v = sport_user['article_vector']
sport_user_vector = [*map(mean, zip(*v))]

def show_results(res):
    ids = [match.id for match in res.matches]
    scores = [match.score for match in res.matches]
    df = pd.DataFrame({'id': ids, 
                       'score': scores,
                       'title': [uploaded_data['title'][int(_id)] for _id in ids],
                       'section': [uploaded_data['section'][int(_id)] for _id in ids],
                       'publication': [uploaded_data['publication'][int(_id)] for _id in ids]
                        })
    display(df)

results = index.query(sport_user_vector, top_k=10)
show_results(results)

Here is the example of previously read articles by this user:

We use word-cloud to visualize the recommended articles for the user.

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

def get_wordcloud_for_user(recommendations):
    wordcloud = WordCloud(
                   max_words=50000, 
                   min_font_size =12, 
                   max_font_size=50, 
                   relative_scaling = 0.9, 
                   stopwords=STOPWORDS,
                   normalize_plurals= True
    )

    clean_titles = [word.to_numpy() for word in recommendations.title.values if word not in stopwords]
    title_wordcloud = wordcloud.generate(' '.join(clean_titles))

    plt.imshow(title_wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

get_wordcloud_for_user(df)

A word-cloud representing the results:

The query results demonstrate the effectiveness of our recommendation system, as it suggests articles that closely align with the user’s reading history.

Conclusion

Given its ability to process terabytes of data at lightning speed, Xorbits would be a wonderful tool to generate vector embeddings, build recommender systems, perform analysis and machine learning, and be applied to many more fields.

Reference

This article recommender tutorial originates from Pinecone’s example. Check out its other examples as well!

Like this content? Give us a star @xprobe-inc/xorbits.