Is Fine-Tuning Still Valuable?
Here is my personal opinion about the questions I posed in this tweet:
There are a growing number of voices expressing disillusionment with fine-tuning.
I'm curious about the sentiment more generally. (I am withholding sharing my opinion rn).
Tweets below are from @mlpowered @abacaj @emollick pic.twitter.com/cU0hCdubBU
I think that fine-tuning is still very valuable in many situations. I’ve done some more digging and I find that people who say that fine-tuning isn’t useful are indeed often working on products where fine-tuning isn’t likely to be useful:
Another common pattern is that people often say this in earlier stages of their product development. One sign that folks are in really early stages is that they don’t have a domain-specific eval harness.
It’s impossible to fine-tune effectively without an eval system which can lead to writing off fine-tuning if you haven’t completed this prerequisite. It’s also impossible to improve your product without a good eval system in the long term, fine-tuning or not.
You should do as much prompt engineering as possible before you fine-tune. But not for reasons you would think! The reason for doing lots of prompt engineering is that it’s a great way to stress test your eval system!
If you find that prompt-engineering works fine (and you are systematically evaluating your product) then it’s fine to stop there. I’m a big believer in using the simplest approach to solving a problem. I just don’t think you should write off fine-tuning yet.
Examples where I’ve seen fine-tuning work well
Generally speaking, fine-tuning works best to learn syntax, style and rules whereas techniques like RAG work best to supply the model with context or up-to-date facts.
These are some examples from companies I’ve worked with. Hopefully, we will be able to share more details soon.
Honeycomb’s Natural Language Query Assistant - previously, the “programming manual” for the Honeycomb query language was being dumped into the prompt along with many examples. While this was OK, fine-tuning worked much better to allow the model to learn the syntax and rules of this niche domain-specific language.
ReChat’s Lucy - this is an AI real estate assistant integrated into an existing Real Estate CRM system. ReChat needs LLM responses to be provided in a very idiosyncratic format that weaves together structured and unstructured data to allow the front end to render widgets, cards and other interactive elements dynamically into the chat interface. Fine-tuning was the key to making this work correctly. This talk has more details.
P.S. Fine-tuning is not only limited to open or “small” models. There are lots of folks who have been fine-tuning GPT-3.5, such as Perplexity.AI: and CaseText, to name a few.