The problem is simple: consumer motherboards don’t have that many PCIe slots, and consumer CPUs don’t have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.
My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.
Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.
Thanks
It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.
May take a look at systems with the newer AMD SoC’s first. They utilize the systems’ RAM and come with a proper NPU, once ollama or mistral.rs are supporting those they might give you sufficient performance for your needs for way lower costs (incl. power consumption). Depending on how NPU support gets implemented it might even become possible to use NPU and GPU in tandem, that would probably enable pretty powerful models to be run on consumer-grade hardware at reasonable speed.
Thanks, but will NPUs integrated along with the CPU ever match the performance of a discrete GPU?
Depends on which GPU you compare it with, what model you use, what kind of RAM it has to work with, ecetera. NPU’s are purpose-built chips after all. Unfortunately the whole tech is still very young, so we’ll have to wait for stuff like ollama to introduce native support for an apples-to-apples comparison. The raw numbers to however do look promising.
It wouldn’t even matter. OP doesn’t understand how any of this works, and is instead just running rampant calling everything bullshit 😂
I’d prefer that you reply with examples/an explanation of what I’m doing wrong instead of cursing
Look at all your replies, chum. Everyone is already telling you.
This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.
Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.
Your point about networking is valid, and I’ll be hesitant to invest in 25Gbe right now
You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.
An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.
I see. That solves a lot of the headaches I imagined I would have. Thank you so much for clearing that up
I assume you’re talking about a CUDA implementation here. There’s ways to do this with that system, and even sub-projects that expand on that. I’m mostly pointing how pointless it is for you to do this. What a waste of time and money.
Edit: others are also pointing this out, but I’m still being downvoted. Mkay.
Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn’t find there to be enough PCIe lanes in any of the used computers I was looking at. If there’s either an affordable card with good performance and 48GB of VRAM, or there’s an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I’ll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.
Thanks
PLEASE look back at the crypto mining rush of a decade ago. I implore you.
You’re buying into something that doesn’t exist.