April 04, 2023

Automated Data Ingestion and AI-Assisted Extraction with GPT-4 and example extraction from VMware Tanzu documentation - Part 2

In this 2 part series, I go over code that can scrape a website or subsection of a website, clean the data, and automate calling GPT-4 with context from the ingested documents with a prompt that successfully extracts desired content in a specified format.

Picking up from where we left off from part 1 of this series, we finished covering the scraping and cleaning functions, and now we will get into calling the Open AI API's using the openai python library, which makes it really simple.

The HtmlScraperSpider class that we covered in part 1 downloaded each of the URL's that we discovered with the UrlScraperSpider class. For this example I had set the scraper to download the files into the /scrapy/html directory in my local filesystem. Next, I want to present each page that I downloaded to the AI model along with instructions about how what we want it to do with the data that we provide. This query we sent to the openai api is referred to as a prompt. Until gpt-3.5-turbo, every example I am aware of used the openai completions endpoint to query the model, and with that endpoint, the field you would pass the query to was called the prompt field. But, with gpt3.5 and gpt4, there is a new format using the ChatCompletions endpoint, which allows you to use a new "messages" format, which is a more flexible and descriptive method to provide the query. I dont know if the chatcompletions endpoint was available before gpt-3.5-turbo, but as far as I can tell, you cannot query the old completions endpoint for gpt3.5 and gpt4, you have to use the chatcompletions endpoint as shown in the code I will share below.

To continue reading this blog, please visit https://artfewell.com/blog/scrape-to-document-extraction-p2/

Filter Tags

Blog Community Content