Verilog · 2026-06-16 · FPGA · RTL · BRAM · memory
Low-resource dual-port BRAM (Verilog)
A simple dual-port RAM — one write port, one read port, separate addresses — is the cheapest way to get a real block RAM out of the synthesizer. The trick is to write the inference template the tools recognise: a single clock, no reset on the array, and a registered read. Get those right and the whole thing maps to one BRAM primitive with zero LUTs; get them wrong (reset the array, or read combinationally) and it falls back to flip-flops or distributed RAM and the resource count explodes.
module bram_sdp #(
parameter integer DATA_WIDTH = 8,
parameter integer ADDR_WIDTH = 10 // 2**ADDR_WIDTH words
)(
input wire clk,
// write port
input wire we,
input wire [ADDR_WIDTH-1:0] wr_addr,
input wire [DATA_WIDTH-1:0] din,
// read port (registered output, one-cycle latency)
input wire [ADDR_WIDTH-1:0] rd_addr,
output reg [DATA_WIDTH-1:0] dout
);
reg [DATA_WIDTH-1:0] mem [0:(1 << ADDR_WIDTH) - 1];
always @(posedge clk) begin
if (we)
mem[wr_addr] <= din;
dout <= mem[rd_addr]; // read-before-write on an address clash
end
endmodule
Why it infers a BRAM (and stays cheap)
- No reset on
mem. Block RAM contents can’t be reset by logic — only initialised. The moment you writeif (!rst_n) mem <= 0;the tool gives up on BRAM and builds a flip-flop array. Leave the array alone; resetdoutdownstream if you must. - Registered read.
dout <= mem[rd_addr]uses the BRAM’s built-in output register, so the one-cycle latency is free and the read path stays fast (high Fmax). A combinationalassign dout = mem[rd_addr];forces distributed LUT-RAM instead. - One write port. A single writer per address is what keeps it a simple dual-port and fits one BRAM. Two write ports means true dual-port — still one BRAM on most families, but watch the same-address write-write rules.
- Read-during-write. When
wr_addr == rd_addrandweis high, both assignments are non-blocking, sodoutcaptures the old contents (read-before-write). If you need the new data on the same cycle, forwarddintodoutwith a small bypass register.
Testbench (self-checking)
Write a few addresses, read them back through the one-cycle pipeline, compare. Build and run
with iverilog -g2012 -o sim design.v tb.v && vvp sim.
`timescale 1ns/1ps
module tb;
localparam integer DW = 8;
localparam integer AW = 4; // 16 words
reg clk = 0;
reg we = 0;
reg [AW-1:0] wr_addr = 0, rd_addr = 0;
reg [DW-1:0] din = 0;
wire [DW-1:0] dout;
bram_sdp #(
.DATA_WIDTH (DW),
.ADDR_WIDTH (AW)
) dut (
.clk (clk),
.we (we),
.wr_addr (wr_addr),
.din (din),
.rd_addr (rd_addr),
.dout (dout)
);
always #5 clk = ~clk;
integer pass = 0, fail = 0;
reg [DW-1:0] expmem [0:15];
task write (input [AW-1:0] a, input [DW-1:0] d);
begin
@(posedge clk);
we <= 1'b1; wr_addr <= a; din <= d;
@(posedge clk);
we <= 1'b0;
expmem[a] = d;
end
endtask
task check (input [AW-1:0] a);
begin
@(posedge clk);
rd_addr <= a;
@(posedge clk); // dut latches mem[a] into dout
@(posedge clk); #1; // dout valid, sample it
if (dout === expmem[a]) begin
pass = pass + 1;
$display(" PASS addr=%0d data=0x%02h", a, dout);
end
else begin
fail = fail + 1;
$display(" FAIL addr=%0d exp=0x%02h got=0x%02h", a, expmem[a], dout);
end
end
endtask
initial begin
write(4'h0, 8'hDE); write(4'h1, 8'hAD);
write(4'h5, 8'hBE); write(4'hF, 8'hEF);
check(4'h0); check(4'h1); check(4'h5); check(4'hF);
$display(" ==== %0d passed, %0d failed ====", pass, fail);
$finish;
end
endmodule
PASS addr=0 data=0xde
PASS addr=1 data=0xad
PASS addr=5 data=0xbe
PASS addr=15 data=0xef
==== 4 passed, 0 failed ====
Usage
- Size it with
DATA_WIDTH/ADDR_WIDTH; the depth is2**ADDR_WIDTH. Keep the total a sensible fraction of a BRAM primitive (e.g. 18 K / 36 Kb on Xilinx) so it packs cleanly. - Two clock domains? Split the body into two
alwaysblocks — write onwr_clk, read onrd_clk— and the same array infers a dual-clock BRAM, the core of an async FIFO (mind the CDC on the pointers). - Preload contents with
$readmemh("init.hex", mem);in aninitialblock — the tools fold it into the BRAM init values, still no logic.