Like many of us, Ive been really swept up by the whole ChatGPT craze. One of the most difficult parts of training machine learning models is that you need data that's already verified to start training. Content that can teach the machine learning model correct answers, or responses that would align with what a human may associate with a request. But getting human provided data on thousands of records, which may be highly specialized, is a substantial investment. For example, consider getting data from source code itself - it seems logical that working source code could provide examples of code snippets that work, but what about answering questions from humans in plain speaking terms about code? What about questions beyond basic code completion where you may want to do planning or exploration and reason together, as you can do with ChatGPT about ideas, but in the context of the style and paradigms of your code project?
So let's say you have a source of static data, like for example, source code or documentation. Could ChatGPT itself provide the simulated human element to add the rich context needed to create a dataset that could result in a bot that could be highly effective in a specialized area? Something like source code and documentation would not be anywhere near enough data to train a model from scratch, but since we can build upon existing models like chatGPT, I feel very optimistic that a relatively simple, inexpensive, and fully automatable process could have this result. I have been wanting to experiment with making some fine-tuning datasets to train an OpenAI model, to see how well it may be able to answer questions with datasets that are made entirely from existing data + ChatGPT to provide the "human" element, so I was excited to spend this past weekend getting started on figuring this out.
Continue reading this post at ArtFewell.com. There is also a companion video, please check it out here!