LogitsProcessor#

Overview#

LogitsProcessor is the preprocessing chain that constrains/masks logits before sampling, enabling hard-constrained decoding or special modes (e.g., think mode). In RTP-LLM, LogitsProcessors work together with the Sampler: they first mask the vocabulary distribution, then regular sampling strategies (top-k, top-p, temperature, etc.) are applied.

Built-in processors#

MultiSeqLogitsProcessor: For finished sequences, only keeps the eos_token valid to prevent further generation.
ThinkModeLogitsProcessor: Think-mode control. Combined with templates/end markers, it constrains tokens within the “thinking” segment by set max_thinking_tokens (internally uses a string-contain DFA).
TreeLogitsProcessor (highlight): Enforces hard decoding constraints based on a prefix → candidate-token-set mapping (tree-structured DFA). Only tokens within the candidate set are allowed.

Tree Decode: How it works#

Tree Decode maintains a tree-shaped DFA (TreeDFA<std::string, int>) at runtime, driven by a mapping of “prefix → candidate tokens”:

State representation: a string key concatenating the start token id and the generated token ids with a separator, e.g., "225_64000_64001".
Candidate set: for the current status, lookup the allowed candidate tokens from the mapping; the logits processor builds a vocabulary mask from this set and sets non-candidate positions to -inf.
Fallback: if the mapping does not contain the current key, only end_token_id is allowed (forces early termination), ensuring safety and validity of constraints.
Beam/multi-seq: supports beam search and num_return_sequences. Each beam/sequence maintains its own DFA state and updates it after each generated token.

Simplified steps:

Initialize DFA: status = str(start_token_id);
At each step, get candidates by status → build mask → set non-candidates to -inf;
After sampling a new token t, update DFA: status = status + sep + str(t);
If no candidates exist for the new status, fallback to [end_token_id].

Note: Ensure that the token_id corresponding to the last token of the input is the root of the tree. If using a prompt string, make sure the last token of the prompt is the tree’s root. If using messages, this must be ensured via the chat template.

JSON configuration format#

The prefix mapping for Tree Decode is provided via a JSON file:

{
  "start_token_id": 225,
  "end_token_id": 2,
  "sep": "_",
  "prefix_dict": {
    "225_64000": [64001, 64002],
    "225_64000_64001": [2]
  }
}

start_token_id: start token id.
end_token_id: end token id.
sep: separator for keys, default _.
prefix_dict: key is the concatenation string of start and already generated token ids; value is the allowed candidate token list under that prefix.

Note: Keys should start from start_token_id. For example, the key for the first-step candidates is "{start}_{t0}", the second step is "{start}_{t0}_{t1}", and so on.

Enabling Tree Decode#

Tree Decode is enabled when the model loads a valid config file (global effect; disabled if not configured):

Service arg / environment variable (recommended):
- Arg: --tree_decode_config <file_name>
- Env: TREE_DECODE_CONFIG=<file_name>
- Resolution rule: the file path is resolved as <ckpt_path>/<file_name>.
C++ direct call (dev/test):
- PrefixToCandidateTokens::instance()->reloadPrefixDict("/abs/path/to/tree.json");

If loaded successfully, you will see: PrefixToCandidateTokens load [path] successfully in logs. If not configured or failed to load, Tree Decode is disabled.

Design and best practices#

Vocab alignment: start_token_id/end_token_id must be valid ids of the current model.
Key convention: keys start with start_token_id; the separator is controlled by sep (default _).
Performance: keep candidate sets as small as possible to reduce masking overhead; build large mappings offline and load per scenario when necessary.
Composition: Tree Decode masks first, then regular sampling (top-k/top-p/temperature) applies; they are compatible.

LogitsProcessor

Contents