For a project I am working on now, I want to create a list of Tanzu CLI commands in a specific format. Tanzu CLI has hundreds of commands each with different options. I need to look at the detailed documentation for command groups to get the level of detail I want to extract, which involves processing hundreds of pages of data.
To automate this process, I am using the python scrapy web scraping library configured to crawl a website, or a subsection of a website. For this example, I wanted to identify all pages within the Tanzu Application Platform (TAP) 1.4 documentation that document the tanzu CLI plugins that are specific to TAP. Docs.vmware.com is a really massive website, it not only has the complete documentation for every single VMware product, it has documentation for every version of every current VMware product.
I dont have any data on how big it is, but, its massive. Just crawling this entire site would be a very inefficient way to identify the specific documents I need, so I originally tried a simple filter to restrict the crawling behavior, which ended up being overly restrictive and not identifying all the documents I needed. I ended up implementing a multi-tiered filtering logic that strikes the right balance between finding all the relevant URLs I want while still being fast and efficient. I will explain in more detail in the scraping section below.
Once the data is gathered and cleaned with scrapy, another function inserts the ingested text into a prompt and uses the openai python library to query GPT-4 with a prompt that successfully instructs GPT-4 to look through the provided text and extract the information I want based on a reference pattern, and formats its return in the exact format I specify in the prompt.
To continue reading this article, please proceed to the author's blog site at https://artfewell.com/blog/scrape-to-document-extraction/