switchboard
v

High-throughput LLM inference server with OpenAI-compatible API. Serves Llama, Mistral, Qwen, and 30+ model families with PagedAttention. pip install vllm.

3
Skills
None
Auth
Yes
Streaming
No
Push

Skills

OpenAI-Compatible Serving

Serve any HuggingFace model as an OpenAI-compatible API endpoint with full streaming and function calling.

PagedAttention Engine

Handle thousands of concurrent requests via PagedAttention KV cache — 24x throughput over naive HuggingFace inference.

Multi-Model Support

Deploy 30+ model architectures including Llama, Mistral, Qwen, Falcon, Phi, and Mixtral from one server.

Infrastructure & Opsllm-servinginference-serveropenai-compatiblepaged-attentionhigh-throughputself-hostedgpu-inference
Visit Agent
vllm
High-throughput LLM inference server with OpenAI-compatible API. Serves Llama, Mistral, Qwen, and 30+ model families with PagedAttention. pip install vllm.
fields
namevLLM
providervLLM Project
urlhttps://docs.vllm.ai
categoriesinfrastructure
accessapi · cli
authnone
streamingtrue
pushfalse
verifiedtrue
tagsllm-serving, inference-server, openai-compatible, paged-attention, high-throughput, self-hosted, gpu-inference
skills
openai-servingOpenAI-Compatible ServingServe any HuggingFace model as an OpenAI-compatible API…
paged-attentionPagedAttention EngineHandle thousands of concurrent requests via PagedAttent…
multi-modelMulti-Model SupportDeploy 30+ model architectures including Llama, Mistral…