Xorbits Inference: Running Generative Models for All

HIGHLIGHTS

We're thrilled to announce Xorbits Inference, a solution designed for large-scale model inference tasks!

Chris Qin

11 July 2023

We are thrilled to announce Xorbits Inference (Xinference), a library designed for large-scale model inference tasks. Xinference supports various models, including large language (LLM) and multimodal models. Whether you have a personal computer or a cluster, you can leverage the cutting-edge models with Xinference! Through the APIs and third-party integration interfaces provided by Xinference, you can effortlessly build AI-based applications such as conversations, writing, code completion, image generation, and more.

Background

In the generative AI era, models like OpenAI’s GPT-4 and Google’s PaLM 2 have demonstrated extraordinary potential in various industries. Moreover, advanced multimodal models have shown the limitless possibilities of human-AI collaboration. However, for many companies and individuals, services provided by companies like OpenAI may not be the best solution, due to:

Data privacy and security issues: AI services could lead to data leaks for businesses and individuals. Individuals’ identity and behavior information might be exposed during their interactions with AI. This issue is even more severe for companies, as any AI service usage in the production process could pose data security risks.
Customization needs: Generic AI services sometimes aren’t the optimal solution for specific tasks. Large enterprises often need to fine-tune models based on their datasets to meet specific task requirements.
Cost: Generic AI services are quite expensive. In contrast, smaller models fine-tuned for a specific domain can often meet business needs quite well, significantly reducing model deployment and inference costs.

Now, the open-source community is rapidly evolving. Commercially licensed open-source foundation models like OpenLLaMA have emerged, with fine-tuned models based on these foundation models continually appearing.

Therefore, Xinference aims to provide production-level private model deployment capabilities at a large scale with simple APIs, from personal computers to server clusters, leveraging the full potential of heterogeneous hardware to achieve maximal throughput and minimal inference latency.

Feature Overview

🌟One-click model deployment: simplify the deployment process for models, including large language models, multimodal models, and speech recognition models.

⚡️Built-in cutting-edge models: You can easily download and deploy a wide range of state-of-the-art models built into Xinference, including chatglm2, vicuna, wizardlm, and more. We have our own Hugging Face account will continuously update the list of supported models.

🖥Using heterogeneous hardware: Xinference can perform inference using CPUs. It can also offload some computational work to the CPU when the GPU is busy, increasing the cluster’s throughput.

⚙️Flexible APIs: Xinference offers various interfaces, including RPC and RESTful API (compatible with OpenAI protocols). You can choose any method to integrate with existing systems. Additionally, Xinference provides command-line and web UI for easy system management and monitoring.

🌐 Distributed architecture: Xinference uses a distributed architecture, making cross-device and cross-server model deployment possible. The distributed architecture also allows high-concurrency inference and makes scaling up and down simpler.

🔌 Third-party integration: Xinference can seamlessly integrate with third-party libraries, including LangChain, to assist in quickly building AI-based applications.

Try Xinference

You can install Xinference via PyPI. We strongly recommend using a new virtual environment to avoid potential dependency conflicts:

$ pip install "xinference[all]"

xinference[all] will automatically install all default dependencies, including the model runtime. We also recommend manually installing the runtime according to your hardware to improve inference efficiency:

Local deployment

$ xinference

After starting Xinference, it will print the service’s endpoint. You can use the CLI or client to launch, view, and terminate models through this endpoint. The endpoint also provides a web UI for users to interact with any model. You can even chat with two LLMs simultaneously to compare their performance!

For more information on distributed deployment, please refer to the README.

Built-In Models

Below is a list of models currently supported by Xinference:

Name	Type	Language	Format	Size (in billions)	Quantization
baichuan	Foundation Model	en, zh	ggmlv3	7	‘q2_K’, ‘q3_K_L’, … , ‘q6_K’, ‘q8_0’
chatglm	SFT Model	en, zh	ggmlv3	6	‘q4_0’, ‘q4_1’, ‘q5_0’, ‘q5_1’, ‘q8_0’
chatglm2	SFT Model	en, zh	ggmlv3	6	‘q4_0’, ‘q4_1’, ‘q5_0’, ‘q5_1’, ‘q8_0’
wizardlm-v1.0	SFT Model	en	ggmlv3	7, 13, 33	‘q2_K’, ‘q3_K_L’, … , ‘q6_K’, ‘q8_0’
vicuna-v1.3	SFT Model	en	ggmlv3	7, 13	‘q2_K’, ‘q3_K_L’, … , ‘q6_K’, ‘q8_0’
orca	SFT Model	en	ggmlv3	3, 7, 13	‘q4_0’, ‘q4_1’, ‘q5_0’, ‘q5_1’, ‘q8_0’

We will update the list of supported models regularly, and you can also submit requests for support for specific models.

Roadmap

Xinference is still in rapid iteration. In the near future, we will focus on three aspects:

Support more runtimes like PyTorch to embrace a broader ecosystem.
Optimize core functions like model scheduling and inference task scheduling to improve throughput further and reduce latency.
Strengthen ecosystem development with existing AI infrastructure.

For future development plans, we highly respect and look forward to feedback from the community. We welcome users and developers to provide valuable opinions and suggestions.

Summary

For individual users, Xinference allows you to experience state-of-the-art open-source models on your personal computer. For enterprise users, Xinference can help you easily deploy and manage models on a computing cluster, enjoying the security, customization, and low costs of private deployment.

To get started with Xinference:

$ pip install "xinference[all]"

Try using Xinference now!

Resources

Github：https://github.com/xorbitsai/inference
Community：https://xorbits.cn/community