February 14, 2023
Can I train an AI effectively with just source code? Accelerating bot training with ChatGPT - Part 1
A lot of the data needed to train a bot has needed significant human input, but, can ChatGPT dynamically provide this data and dramatically simplify and accelerate bot training? Lets find out!
Like many of us, Ive been really swept up by the whole ChatGPT craze. It's inspiring me to dive deeper into machine learning. Just since ChatGPT launched I have already taken a few courses in machine learning and AI and its been a fun and humbling experience. Machine learning is an extremely deep subject that is filled with really complex math. Fortunately I had some background from grad school doing similar math so I am not completely lost, but I think perhaps a more important realization for me was how ChatGPT itself dramatically changes the process of technology transfer for the dissimination of machine learning as it grows.
Every time there is a complex new technology, at first you have to have very specialized knowledge to understand the language of the scientists who have made key discoveries and advancements, and it usually takes years for companies to build increasingly common language and abstractions to the point where there are simplified software interfaces with full sets of documentation and training in more common language that is easier to understand. This process of technology transfer has to happen before any new complex technology can grow towards becoming mainstream, there simply aren't enough people out there who can read doctoral level research and distill it into practical day-to-day applications. The real big impact of this is, any time there is a new technology, you can only grow it as fast as the market can absorb it.
This to me is one of the most fascinating and powerful impacts of ChatGPT. It understands the deep scientitific and mathematical language of the latest machine learning research and can explain concepts better than most tutors, it can break concepts down to simple levels and then granularly add detail as you learn more. It can use its knowledge of advanced machine learning constructs to help you easily figure out a variety of different potential practical applications and then support you step by step as you seek to apply the technologies. Its imperfect, but it would have taken me years more study to even arrive at whatever limited insights I gained just over this past weekend. And while this is a cool exercise, perhaps the biggest impact of it is the realization that, having a trained bot can dramatically, profoundly accelerate tech transfer. You could start training a bot with your source code as you are starting a new development project, you could teach it the paradigms, practices and requirements for your project and it could onboard and pair program with your devs to make start faster and produce better and more consistent code and even with automated code completion kind of like co-pilot, but specific to your projects requirements. As you prepare support teams it could both train them and be a consistently available tutor and ... it can also help troubleshoot errors, a lot. It can work with you to write PRD's that are highly nuanced, help prepare marketing plans, and by launch be ready to help train your salesforce and customer base. You could train a model that could effectively convert product knowledge to training courses on the fly.
While it is always interesting that new technology disruptions accelerate future disruptions because each disruptive tech brings capabilities that help accelerate the development of the next disruptive tech. But typically a technical innovation will have a technical discovery that makes developing the next technically possible. Its not often that a new tech arrives that can fundamentally change market adoption of new technologies by offering the potential to nearly eliminate the time needed for humans to be able to understand and apply the new technology. The internet itself was similar in this regard, but the potential for ChatGPT to address this issue, it will have a much larger impact in this regard even when compared to the introduction of the internet itself.
Now, one of the most difficult parts of training machine learning models is that you need data that's already verified to start training. Content that can teach the machine learning model correct answers, or responses that would align with what a human may associate with a request. But getting human provided data on thousands of records, which may be highly specialized, is a substantial investment. For example, consider getting data from source code itself - it seems logical that working source code could provide examples of code snippets that work, but what about answering questions from humans in plain speaking terms about code? What about questions beyond basic code completion where you may want to do planning or exploration and reason together, as you can do with ChatGPT about ideas, but in the context of the style and paradigms of your code project?
So let's say you have a source of static data, like for example, source code or documentation. Could ChatGPT itself provide the simulated human element to add the rich context needed to create a dataset that could result in a bot that could be highly effective in a specialized area? Something like source code and documentation would not be anywhere near enough data to train a model from scratch, but since we can build upon existing models like chatGPT, I feel very optimistic that a relatively simple, inexpensive, and fully automatable process could have this result. I have been wanting to experiment with making some fine-tuning datasets to train an OpenAI model, to see how well it may be able to answer questions with datasets that are made entirely from existing data + ChatGPT to provide the "human" element, so I was excited to spend this past weekend getting started on figuring this out.
Over this past weekend, I decided to dive into the deep end and get started on a fun project to try to train an ai bot. I was planning on starting off trying to make a Tanzu bot and I still want to, but I recently got assigned to a project that I think would be an interesting and fun use case to try out. If you follow VMware good chance you have seen the recent product announcements for Aria Graph and Aria hub, which are really cool technologies that can model entire cloud infrastructures. It's a beautiful UI they're able to layer on top of it, which signifies the efficacy of the data modeling they're doing underneath.
One of the key underlying technologies that Aria Graph and Hub are built on is [the Idem project](https://www.idemproject.io/), which is a data flow-oriented programming language. The project has pretty good documentation, but I am also interested in using it for new and advanced use cases and plugin development - getting to this level of knowledge can take significant time and effort, and if this bot training yeilds effective results, it could dramatically accelerate my own path to gaining competence with this new language.
On top of that, Idem itself is based in another interesting new open source project called [Plugin Oriented Programming (POP)](https://pop-book.readthedocs.io/en/latest/). In the case of POP, we have source code, but since its a paradigm, the source code is really for tools that support programming in this paradigm. I think this is a very fascinating angle to approach the hypothesis of bot efficacy from source code and documents, as a paradigm represents a different and higher order sort of reasoning to be effective, and I really want to become proficient with pop quickly, so I think this makes pop a really interesting subject for bot training.
I'm starting my project with the source code for POP and using it to generate training data. I will also use the documentation for Pop and source code from other related projects built on the POP paradigm to create more data. After cleaning and tokenizing the data, I'll use some python code and some freely available models to create embeddings, which will help the bot to calculate relationships and probabilities between words.
In the world of machine learning, natural language processing (NLP) is a critical area of study. It involves training models to understand human language and generate responses that sound like they were written by a person. One key challenge in NLP is figuring out how to represent words and sentences as numerical values that can be processed by a machine. This is known as tokenization and embedding.
I'm currently working on two approaches to NLP: creating embeddings from the entire base of source code, and using snippets. Snippets are a more common way to start, as they are smaller and easier to work with than full source code. They provide a lot of insights, and you can still be very effective at using them without understanding every deep aspect of the mathematics behind NLP.
To start, I'm worked with ChatGPT to create some Python code that searches a directory/repo of source code files and extracts snippets, which are just the functions, classes and main parts of code. Next, I will add rich contextual data, such as labels, annotations, and questions - this data is critical for yielding an effective model, but is normally where we would need a lot of specialized human effort to derive.
Here is the script I used to extract the snippets, which also places a delimeter between each snippet: