为自动化Verilog RTL代码生成的大型语言模型的基准测试

论文标题

为自动化Verilog RTL代码生成的大型语言模型的基准测试

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

论文作者

Thakur, Shailja, Ahmad, Baleegh, Fan, Zhenxing, Pearce, Hammond, Tan, Benjamin, Karri, Ramesh, Dolan-Gavitt, Brendan, Garg, Siddharth

论文摘要

自动化硬件设计可以消除工程过程中大量的人为错误，并导致较少的错误。 Verilog是一种流行的硬件描述语言，用于建模和设计数字系统，因此生成Verilog代码是关键的第一步。新兴的大语言模型（LLMS）能够用其他编程语言编写高质量的代码。在本文中，我们表征了LLM生成有用的Verilog的能力。为此，我们在GitHub和Verilog教科书收集的Verilog数据集上微调了预训练的LLM。我们构建了一个评估框架，该框架包括用于功能分析的测试台和用于测试响应不同难度问题的Verilog代码语法的流量。我们的发现表明，在我们的问题方案中，LLMS的微调能够更有能力产生句法正确的代码（总体25.9％）。此外，在分析功能正确性时，微调的开源Codegen LLM的表现可以优于最先进的商业法典LLM（总体6.5％）。提供培训/评估脚本和LLM检查点：https：//github.com/shailja-thakur/vgen。

Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题