A groundbreaking experiment has successfully developed a Rust-focused language model from scratch, leveraging a custom byte-level architecture and innovative attention mechanism to achieve remarkable performance and efficiency.
The model’s architecture is built around a byte-level GPT-style decoder, featuring 8 layers, 8 attention heads, and a vocabulary size of 256. Notably, it incorporates a novel HybridAttention mechanism, which seamlessly combines local windowed causal attention, a GRU-like recurrent state path, and a learned gate. This hybrid approach enables the model to capture both short-range syntax and long-range dependencies without incurring the quadratic cost of full attention.
To train the model, a comprehensive corpus of Rust code was compiled, encompassing official documentation, compiler and library code, and a curated selection of popular crates. The corpus was further expanded to 173.5 million bytes by incorporating the top 500 crates, resulting in significant performance improvements. The training process utilized AdamW with a learning rate of 2e-4, weight decay of 0.1, and betas of (0.9, 0.95), with 30,000 steps and 1,000 warmup steps.
The results are impressive, with the model achieving a final train loss of 0.5834, validation loss of 0.8217, and perplexity of 2.15. Furthermore, the HybridAttention mechanism enables a remarkable 51.47x speedup compared to full attention, without compromising quality. The model’s generation capabilities are also demonstrated, producing plausible Rust syntax, imports, and signatures.
Photo by Okiki Onipede on Pexels
Photos provided by Pexels
