I just hosted both models here: https://chat.tune.app/ Playground: https://studi...

ChristophGeske · on April 19, 2024

Thanks for the link I just tested them and they also weark in europe without the need to start a VPN. What specs are needed to run these models. I mean the llama 70B and the Wizard 8Bx22 model. On your site they run very nicely and the answears they provide are really good they booth passed my small test and I would love to run one of them locally. So far I only ran 8B models on my 16GB RAM pc using LM Studio but having such good models run locally would be awesome. I would upgrade my ram for that. My pc has an 3080 laptop GPU and I can increase the RAM to 64GB. As I understood it a 70B model needs around 64 GB but maybe only if it quantized. Can you confirm that? Can I run Llama 3 as well as you when I simply upgrade my RAM sticks. Or are you running it on a cloud and you can't say much about the requirements for windows pc users? Or do you have hardware usage data for all the models on your site and you can tell us what they need to run?

namanski · on April 22, 2024

Hey Christoph, thanks for trying it out - we're running this on the cloud, particularly GCP, on A100s (80g).

On your query about running these models locally, I'm not sure if just upgrading your RAM would have the same throughput as what you see on the website. You can upgrade your RAM but you might get pretty bad tokens/sec.

ChristophGeske · on April 23, 2024

Thanks for the reply.

I am currently testing the limits and got llama 3 70B in a 2bit-quantized form to run on my laptop with very low specs RTX3080 8GB VRAM (laptop version) and 16GB system RAM. It runs with 1,2 tokens/s which is a bit slow. The biggest issue however is the time it takes for the first token to be printed which fluctuates and takes between 1.8s to 45s.

I tested the same model on a 4070 with 16GB VRAM (desktop pc version) and 32GB system RAM and it runs at about 3-4 tokens per second. The 4070 also has the issue with quite long time for the first token to be displayed i think it was around 12s in my limited testinh.

I still try to find out how to speed the time to initial token up. 4 tokens a second is usable for many cases because that's about reading speed.

There are also 1bit-quantized 70B models appearing so there might be ways to make it even a bit faster on consumer GPUs.

I think we are at the bare edge of usability here and I keep testing.

I can not tell exactly how this strong quantization affects output quality information about that is mixed and seems to depand on the form of quantization as well.