Papers, experiments, and perspectives from the Bud research team on making AI radically simpler, more efficient, and accessible to everyone.
In this paper, we explore the utilization of CPUs for accelerating the inference of large language models.
Read the paperWe introduce Resource Aware Attention (RAA), a new attention mechanism designed from the ground up against the resource envelope of the deployment…
Read research
We present GPU-Virt-Bench, a comprehensive benchmarking framework that evaluates GPU virtualization systems across 56 performance metrics organized into 10 categories.
Read research
This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response…
Read research
With a composition of 11.53 billion tokens, integrating 8.01 billion tokens of synthetic data with 3.52 billion tokens of rich textbook data,…
Read research
In this paper, we explore the utilization of CPUs for accelerating the inference of large language models.
Read research