Verilog · 2026-06-16 · FPGA · RTL · BRAM · memory

Low-resource dual-port BRAM (Verilog)

A simple dual-port RAM — one write port, one read port, separate addresses — is the cheapest way to get a real block RAM out of the synthesizer. The trick is to write the inference template the tools recognise: a single clock, no reset on the array, and a registered read. Get those right and the whole thing maps to one BRAM primitive with zero LUTs; get them wrong (reset the array, or read combinationally) and it falls back to flip-flops or distributed RAM and the resource count explodes.

module bram_sdp #(
  parameter integer DATA_WIDTH = 8,
  parameter integer ADDR_WIDTH = 10        // 2**ADDR_WIDTH words
)(
  input  wire                  clk,

  // write port
  input  wire                  we,
  input  wire [ADDR_WIDTH-1:0] wr_addr,
  input  wire [DATA_WIDTH-1:0] din,

  // read port (registered output, one-cycle latency)
  input  wire [ADDR_WIDTH-1:0] rd_addr,
  output reg  [DATA_WIDTH-1:0] dout
);
  reg [DATA_WIDTH-1:0] mem [0:(1 << ADDR_WIDTH) - 1];

  always @(posedge clk) begin
    if (we)
      mem[wr_addr] <= din;
    dout <= mem[rd_addr];          // read-before-write on an address clash
  end
endmodule

Why it infers a BRAM (and stays cheap)

  • No reset on mem. Block RAM contents can’t be reset by logic — only initialised. The moment you write if (!rst_n) mem <= 0; the tool gives up on BRAM and builds a flip-flop array. Leave the array alone; reset dout downstream if you must.
  • Registered read. dout <= mem[rd_addr] uses the BRAM’s built-in output register, so the one-cycle latency is free and the read path stays fast (high Fmax). A combinational assign dout = mem[rd_addr]; forces distributed LUT-RAM instead.
  • One write port. A single writer per address is what keeps it a simple dual-port and fits one BRAM. Two write ports means true dual-port — still one BRAM on most families, but watch the same-address write-write rules.
  • Read-during-write. When wr_addr == rd_addr and we is high, both assignments are non-blocking, so dout captures the old contents (read-before-write). If you need the new data on the same cycle, forward din to dout with a small bypass register.

Testbench (self-checking)

Write a few addresses, read them back through the one-cycle pipeline, compare. Build and run with iverilog -g2012 -o sim design.v tb.v && vvp sim.

`timescale 1ns/1ps
module tb;
  localparam integer DW = 8;
  localparam integer AW = 4;          // 16 words

  reg           clk = 0;
  reg           we = 0;
  reg  [AW-1:0] wr_addr = 0, rd_addr = 0;
  reg  [DW-1:0] din = 0;
  wire [DW-1:0] dout;

  bram_sdp #(
    .DATA_WIDTH (DW),
    .ADDR_WIDTH (AW)
  ) dut (
    .clk     (clk),
    .we      (we),
    .wr_addr (wr_addr),
    .din     (din),
    .rd_addr (rd_addr),
    .dout    (dout)
  );

  always #5 clk = ~clk;

  integer pass = 0, fail = 0;
  reg [DW-1:0] expmem [0:15];

  task write (input [AW-1:0] a, input [DW-1:0] d);
    begin
      @(posedge clk);
      we <= 1'b1; wr_addr <= a; din <= d;
      @(posedge clk);
      we <= 1'b0;
      expmem[a] = d;
    end
  endtask

  task check (input [AW-1:0] a);
    begin
      @(posedge clk);
      rd_addr <= a;
      @(posedge clk);          // dut latches mem[a] into dout
      @(posedge clk); #1;      // dout valid, sample it
      if (dout === expmem[a]) begin
        pass = pass + 1;
        $display("  PASS  addr=%0d  data=0x%02h", a, dout);
      end
      else begin
        fail = fail + 1;
        $display("  FAIL  addr=%0d  exp=0x%02h  got=0x%02h", a, expmem[a], dout);
      end
    end
  endtask

  initial begin
    write(4'h0, 8'hDE); write(4'h1, 8'hAD);
    write(4'h5, 8'hBE); write(4'hF, 8'hEF);
    check(4'h0); check(4'h1); check(4'h5); check(4'hF);
    $display("  ==== %0d passed, %0d failed ====", pass, fail);
    $finish;
  end
endmodule
  PASS  addr=0  data=0xde
  PASS  addr=1  data=0xad
  PASS  addr=5  data=0xbe
  PASS  addr=15  data=0xef
  ==== 4 passed, 0 failed ====

Usage

  • Size it with DATA_WIDTH / ADDR_WIDTH; the depth is 2**ADDR_WIDTH. Keep the total a sensible fraction of a BRAM primitive (e.g. 18 K / 36 Kb on Xilinx) so it packs cleanly.
  • Two clock domains? Split the body into two always blocks — write on wr_clk, read on rd_clk — and the same array infers a dual-clock BRAM, the core of an async FIFO (mind the CDC on the pointers).
  • Preload contents with $readmemh("init.hex", mem); in an initial block — the tools fold it into the BRAM init values, still no logic.