🦸♂️ Today I learned to fine-tune LLaMa 7B on my home RTX3090 GPU with 24GB VRAM, using QLoRA (Quantized Low-Rank Adaptation)!
🧠 LLMs are huge, needing a ton of memory to store and adjust their billions of parameters. CUDA out of memory errors haunt us.
🛠️ DeepLearning.AI hosted a workshop by Piero Molino and Travis Addair from the open-source Ludwig project (9.3k⭐), a low-code framework to fine-tune LLMs from a simple .yaml specification. Go give it a star and check it here
⚖️ Travis did a great job explaining the LLM memory footprint problem, methods to tackle it with quantization (projecting the original floating point parameter representation, e.g. 32-bit, into a lower precision space, e.g. 16, 8, or 4 bit), gradient accumulation (collecting gradients over batches before updating weights), and other approaches.
🦉 Ludwig handles these optimisations for you. Smart.
🎯 Opportunities for smaller, cheaper, tighly-controlled, domain specific fine-tuned models with lower latency over a RAG>general model chain.
Here’s the workshop recording and associated training notebook, enjoy.