Language models
The power and economy of open-source LLMs with the ease of serverless.
Bleeding-edge engines
With Dibtrun , you no longer have to choose between ease of use and the latest developments in language model research—you can have both!
All state-of-the-art LLM serving frameworks work out of the box, including:
vLLM
text-generation-inference
MLC
CTranslate2
Optimal usage
Dibtrun helps you squeeze the last bit of utilization out of your GPUs. If your LLM framework supports continuous batching for greater token throughput, you can gain the benefits from that with a single config change.
Token broadcasting
Activate token streaming in your language model by turning your Python function into a generator. This seamlessly integrates with HTTPS endpoints, allowing direct stream subscription from your Node.js backend!!
@method()
def generate(self, prompt: str):
yield from pipeline(self.model, self.tokenizer, {"prompt": prompt})
LoRA simplified
Low-rank adaptation (LoRA) is a method for developing fine-tune models as small adapters for the original mode.
Dibtrun's parametrized functions ease the creation of applications that infer a variable set of LoRA adapters. Volumes and immediately have them ready to go for inference.