• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    2 days ago

    Depends on the quantization.

    7B is small enough to run it in FP8 or a Marlin quant with SGLang/VLLM/TensorRT, so you can probably get very close to the H20 on a 3090 or 4090 (or even a 3060) and you know a little Docker.