Shader Authoring and Dispatch

vkdispatch lets you write compute logic in Python syntax and compile it to GLSL at runtime. This page covers shader launch patterns and the key semantics of vkdispatch’s runtime shader generation model.

Examples below omit vd.initialize() and vd.make_context() because vkdispatch creates them automatically on first runtime use. That default path is intentional: generated shaders are specialized against the current machine/runtime unless you explicitly choose dummy-mode codegen.

Runtime Generation Model

@vd.shader executes your Python function with tracing objects and emits shader code as each operation runs. In practice:

  1. vkdispatch inspects type-annotated arguments and creates shader variables.

  2. arithmetic, indexing, swizzles, and assignment append GLSL statements.

  3. the generated source is compiled into a compute plan and then dispatched.

This is different from AST/IR compilers: it is a forward streaming model, so explicit register materialization and explicit shader control-flow helpers matter for performance and correctness.

Default Runtime-Coupled Generation

By default, vkdispatch generates shaders for the active runtime backend and uses that runtime’s limits when choosing implicit launch defaults such as local_size.

This is the normal mode for end-to-end execution:

  1. define the kernel with @vd.shader

  2. let vkdispatch auto-initialize or call vd.initialize(...) yourself

  3. execute the shader or inspect get_src() for the current machine

If you want controlled source generation without relying on the active runtime, use the dummy backend explicitly.

Imports and Type Annotations

Most shader examples use these imports:

import vkdispatch as vd
import vkdispatch.codegen as vc
from vkdispatch.codegen.abbreviations import *
  • Buff[...] is a shader buffer argument type.

  • Const[...] is a uniform/constant argument type.

  • Dtype aliases such as f32, i32, and v2 come from abbreviations.

Basic In-Place Kernel

import numpy as np
import vkdispatch as vd
import vkdispatch.codegen as vc
from vkdispatch.codegen.abbreviations import *

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def add_scalar(buff: Buff[f32], bias: Const[f32]):
    tid = vc.global_invocation_id().x
    buff[tid] = buff[tid] + bias

arr = np.arange(32, dtype=np.float32)
buff = vd.asbuffer(arr)
add_scalar(buff, 1.5)

result = buff.read(0)
print(result[:4])  # [1.5 2.5 3.5 4.5]

Launch Configuration

Use one of these launch patterns:

  • String expression (evaluated from function argument names):

    # @vd.shader(exec_size=lambda args: args.in_buf.size)
    @vd.shader("in_buf.size")
    def kernel(in_buf: Buff[f32], out_buf: Buff[f32]):
        ...
    
  • Fixed total dispatch size:

    @vd.shader(exec_size=(1024, 1, 1))
    def kernel(...):
        ...
    
  • Dynamic size from call arguments:

    @vd.shader(exec_size=lambda args: args.in_buf.size)
    def kernel(in_buf: Buff[f32], out_buf: Buff[f32]):
        ...
    
  • Explicit workgroups instead of exec_size:

    @vd.shader(workgroups=(64, 1, 1), local_size=(128, 1, 1))
    def kernel(...):
        ...
    

exec_size and workgroups are mutually exclusive. The string form is often the most concise option for argument-dependent dispatch size. It is evaluated dynamically, so it is slightly more brittle than the lambda form. When you want the declaration itself to be more explicit and deterministic, prefer exec_size=lambda args: ....

You can also override launch parameters per call:

# Reuse the same compiled shader with different dispatch sizes.
add_scalar(buff, 1.5, exec_size=buff.size)

Symbolic Expressions vs Mutable Registers

vkdispatch variables are symbolic by default. Reusing an expression in multiple places inlines that expression each time in generated code.

To materialize a value once and mutate it, convert it to a register with to_register():

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def register_example(buff: Buff[f32]):
    tid = vc.global_invocation_id().x

    # Expression variable: may be inlined at each use.
    expr = vc.sin(tid * 0.1)

    # Register variable: emitted once, then reused.
    cached = expr.to_register("cached")

    buff[tid] = cached * 2.0 + cached / 3.0

Register Store Syntax ([:])

Python assignment rebinding (x = ...) changes the Python name, not the generated shader register. To emit a GLSL assignment into an existing register, use full-slice store syntax x[:] = ....

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def register_store(buff: Buff[f32]):
    tid = vc.global_invocation_id().x
    value = buff[tid].to_register("value")
    value[:] = value * 0.5 + 1.0
    buff[tid] = value

Shader Control Flow vs Python Control Flow

Native Python control flow with vkdispatch variables is intentionally blocked:

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def bad_branch(buff: Buff[f32]):
    tid = vc.global_invocation_id().x
    if tid < 10:  # Raises ValueError: vkdispatch variables are not Python booleans.
        buff[tid] = 1.0

Use shader control-flow helpers so both branches are emitted into generated code:

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def threshold(buff: Buff[f32], cutoff: Const[f32]):
    tid = vc.global_invocation_id().x

    vc.if_statement(buff[tid] > cutoff)
    buff[tid] = 1.0
    vc.else_statement()
    buff[tid] = 0.0
    vc.end()

Generation-Time Specialization (Meta-Programming)

Because kernel bodies execute as normal Python during generation, Python loops and conditionals are useful for specialization and unrolling.

def make_unrolled_sum(unroll: int):
    # @vd.shader(exec_size=lambda args: args.dst.size)
    @vd.shader("dst.size")
    def unrolled_sum(src: Buff[f32], dst: Buff[f32]):
        tid = vc.global_invocation_id().x
        base = (tid * unroll).to_register("base")
        acc = vc.new_float_register(0.0)

        # Unrolled at generation time.
        for i in range(unroll):
            acc += src[base + i]

        dst[tid] = acc

    return unrolled_sum

sum4 = make_unrolled_sum(4)
sum8 = make_unrolled_sum(8)

# sum4 and sum8 compile to different shaders with different unrolled bodies.

Mapping Functions

Mapping functions are reusable typed snippets (often used with reductions and FFT I/O).

@vd.map
def square_value(x: Buff[f32]) -> f32:
    idx = vd.reduce.mapped_io_index()
    return x[idx] * x[idx]

You can pass mapping functions into APIs that accept mapping_function, input_map, or output_map arguments.

Inspecting Generated Shader Source

get_src() returns the generated source for the currently selected runtime/codegen configuration. In the default mode, that means the generated shader is tied to the current machine/runtime by design.

For explicit codegen-only workflows, initialize the dummy backend first and select the output backend you want:

import vkdispatch as vd
import vkdispatch.codegen as vc
from vkdispatch.codegen.abbreviations import Buff, Const, f32

vd.initialize(backend="dummy")
vd.set_dummy_context_params(
    subgroup_size=32,
    max_workgroup_size=(128, 1, 1),
    max_workgroup_count=(65535, 65535, 65535),
)
vc.set_codegen_backend("cuda")

# @vd.shader(exec_size=lambda args: args.buff.size)
@vd.shader("buff.size")
def add_scalar(buff: Buff[f32], bias: Const[f32]):
    tid = vc.global_invocation_id().x
    buff[tid] = buff[tid] + bias

print(add_scalar.get_src(line_numbers=True))

A built shader can be printed for debugging:

print(add_scalar)

This prints GLSL-like generated source with line numbers, which is useful when debugging type issues or unsupported expressions.

Common Notes

  • All shader parameters must be type annotated.

  • Buffer/image arguments must use codegen types (for example, Buff[f32], Img2[f32]).

  • If you need batched submissions, prefer Command Graph Recording.

Shader API Reference

See the Full Python API Reference for complete API details on:

  • vkdispatch.shader

  • vkdispatch.map

  • vkdispatch.ShaderFunction

  • vkdispatch.MappingFunction