Dibtrun

Bleeding-edge engines

With Dibtrun , you no longer have to choose between ease of use and the latest developments in language model research—you can have both!

All state-of-the-art LLM serving frameworks work out of the box, including:

vLLM
text-generation-inference
MLC
CTranslate2

Optimal usage

Dibtrun helps you squeeze the last bit of utilization out of your GPUs. If your LLM framework supports continuous batching for greater token throughput, you can gain the benefits from that with a single config change.

Token broadcasting

Activate token streaming in your language model by turning your Python function into a generator. This seamlessly integrates with HTTPS endpoints, allowing direct stream subscription from your Node.js backend!!

@method()

def generate(self, prompt: str):

yield frompipeline(self.model, self.tokenizer, {"prompt": prompt})

LoRA simplified

Low-rank adaptation (LoRA) is a method for developing fine-tune models as small adapters for the original mode.

Dibtrun's parametrized functions ease the creation of applications that infer a variable set of LoRA adapters. Volumes and immediately have them ready to go for inference.