Using Large Language Models to follow natural languagedirectives by auto-composing invocations of self-discovered capabilities

We have built a prototype AI Agent which combines the power of GPT-3, a live language REPL, and capabilities to read and write from the Internet to perform more generalized problem-solving than is possible with just frozen Large Language Models, as well as traditional search engine results.

alter-text
‘Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day?’
Neither GPT-3 or Google return the correct result. Example from Large Language Models are Zero-Shot Reasoners.
alter-text
‘Grab the Wikipedia page for Gagarin. Split the text into lines. Count the words on each line. Print the average number of words per line’
Neithter GPT-3 or Google are capable of this.
alter-text
‘The company had 655400 shares which they divided equally among 94 employees. How many did each employee get?’
A result we have replicated from an example from AI21labs, an AI company out of Israel with $118.5mm in funding.
alter-text
''
A result we have replicated from an example from AI21labs, an AI company out of Israel with $118.5mm in funding. Neither GPT-3 or Google return the correct result.
alter-text
''
A result we have replicated from an example from AI21labs, an AI company out of Israel with $118.5mm in funding.

Summary

In this article, we:

  • Survey the present state of art of AI with Large Language Models (LLMs)
  • Illustrate the present capabilities and deficits of AI Agent Models built on top of Large Language Models
  • Propose a method for auto-composing invocations of self-discovered capabilities to enable a user to complete an end-to-end task by simply providing a natural language directive. We demonstrate the feasibility of this approach by implementing a prototype system and show that it is able to complete a set of end-to-end tasks with high accuracy.

The Communication Gulf Between Humans and Machines

The principal constraint in capturing the utility of computers – whether they be supercomputers, mainframes, or the embedded micro-controller in your microwave – is in the interaction between the human expressing their desired end-state through the User Interface to the machine. Industry has invented and iterated upon thousands of different ideas to help humans convey their intent, but we struggle with even the simplest of these User Experiences. Remember programming your VCR to record your favourite show? Ever try to use a friend’s washing machine?

In Text Is the Universal Interface, @tszzl argues that the emergence of LLMs, like GPT-3, will turn back the clock on User Interfaces back to the original computer shell most readers were first introduced to: “C:>”.

Code editor

What is probably underappreciated by less technical audience members, and probably even by most computer-savvy people who do not spend much of their day problem-solving in the liminal spaces that exist between computer programming languages, as the story goes, is that “it’s turtles all the way down”: every piece of software you use today is written in high-level programming languages (say, Java) which get compiled and transformed many, many times before their intent is released as a stream of electrons which play a game of Marble Madness through wafers of silicon – which themselves are words and sentences in the language of logic gates. The mere act of rendering this webpage implicates Javascript, HTML, CSS, C++, bytecode, intermediate object code, assembly, machine code… and many others! The foundation of all computing is language, and the mortar is the collection of languages that bridge between other languages, or levels of abstraction.

There is an entire class of languages we haven’t even mentioned yet: protocols and APIs. These are languages that computer programs speak to each other in. Send a tweet? Pay a bill? Phone someone? An API call did that over a stack of protocols.

«Throw some example graphics demonstrating how a user intention would be translated to an API call.»

GPT-3 is a transformer model, and transformer models are language mavens

By now, most people will have familiarity with one, if not multiple, forms of deep learning, and probably transformer models, which are performing nearly magical feats: creating fascinating and stunning art that expresses the individuality of the prompter, while speaking in the visual languages of art. Want to see how Picasso would have painted Trump in cubismo style? The Big Bang from the perspective of God? How about four of each in five seconds?

These image models are powered by the same transformer model technology in Large Language Models, such as GPT-3, only the models that generate text from prior text are actually much, much larger. While DALLE-2, the groundbreaking text-to-image AI from OpenAI, is around 3.5 billion parameters in size, GPT-3 – which is still probably most advanced publicly available language model – is 175 billion parameters. And the language model that a Google researcher was fired over publicly claiming it might be sentient was 134 billion. (Most people in related fields seem to think these language models are not sentient, but facsimiles of sentient output, leaving the sensible among us to wonder just what the hell the difference is…)

There are now even models which cross into the trillions of parameters, which is where GPT-4 from OpenAI is expected to be. If you haven’t played around with one of these yet, make your way over to OpenAI’s website, sign up, and hit the Playground. Just start talking to it – about anything. Ask it to write you a sonnet about your dog, or even a computer program to calculate Pi. It will do both. And so many other things that we are still discovering new capabilities practically every day in 2022, more than two years after it first became available in May of 2020. If you would like a review of the extent of GPT-3’s capbilities – including both its impressive capabilities and disappointing deficits – you can find one here. «Gotta write this. Or find one to link to.»

So, can GPT-3 write programs?

Yes, it can. In fact, it is quite good at it – although it’s not quite as good as a high-level human yet in many respects. Better models are being trained every day. And even more importantly, it’s getting better at discovering how to write programs to solve specific tasks. This is a really big deal, because it means that the computer can, in theory at least, write its own programs to do what we want it to do, without us having to write a single line of code. All we have to do is tell it what to do in natural language.

The catch is that, while GPT-3 can write programs that do what we want, it is not yet very good at discovering what we want it to do. So, if we want it to write a program to do something specific, we have to be very explicit about what that is. We have to give it a very clear set of instructions, in natural language, about what we want it to do.

So, the challenge is not in teaching GPT-3 what to do. The challenge is in asking it to do the right thing.

“Prompt Engineering” is no longer ironic

That GPT-3 can write code is not a new discovery. It was one of the early discoveries of GPT-2.

What was known at least as early as 2021 «cite» by Chinese researchers is that GPT-3 has read through many rmarkdown and Jupityr notebooks. These are meant to give humans a combined view of the natural language, code and each intermediate to final output for each step of a process. It has read so many that GPT-3 can generate the code sections from the natural language sections.

This format of programming is very much derived from Donald Knuth’s programming methodology known as “Literate Programming”.

«great notebook screenshot exmple»

In fact, if you let it, GPT-3 will not only generate the code, it will generate its expectation of the output – and even the next bit of natural language.

The code sections of these notebooks are run in something called a REPL - Read-Execute-Print Loop. Each chunk of code is executed sequentially (“Read”, then “Execute”), the result is displayed in-line with the code (“Print”), and then the whole thing repeats for the next segment of code (“Loop”).

Inserting the Ghost Into the Machine

Language models of all sizes work the same way: text in, text out. They retain no memories of past invocations. And they have no way to reach into the real world – whether it’s to search for new information, or perform actions in pursuit of a goal.

So, while they demonstrate remarkable capabilities that include reasoning, writing code, and authoring blog posts, they are ultimately limited to providing a model of the language in which they were trained.

There are a few ways which language models are being used today to interact with the real world. Siri and Alexa demonstrate one of them, mapping specified tasks to specific function invocations. This is what powers your ability to say, “Hey Siri, remind me I have a doctor’s appt at 3pm next thursday on the morning of”. The most sophisticated and modern way is through something called Action Transformers, which is behind the technology at Adept.AI.

These are all techniques which allow AI agents to process natural language directives on specific tasks that it has been pre-trained to identify, marshal into a specific format (essentially creating an AI that fills out forms which can be used as parameters for regular old deterministic code to understand), and executing.

The first step to availing the otherwise-cloistered capabilities of Large Language Models to humans to do real work in the same way is by inserting them along with the user into a REPL. The user provides natural language instructions, which the AI is prompted to help with in notebook format, and the AI builds out the corresponding code section. The code is executed, and the result is appended to the session’s notebook, and fed back into the AI to either continue its process, or decide that it has satisfied the request, and break out of the loop to deliver the final result.

One obvious example is using GPT-3 for symbolic mathematical reasoning – such as performing high-school level math problems that require the student to convert a natural language request (say, “Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 per egg. How much does she make every day?”) into some basic algebra. Interestingly, Large Language Models tend to mimic human abilities with language and math – we are really good at language, but mostly pretty bad at doing math in our head.

GPT-3 will try, and do an admirable job at guessing, but will usually get close, but not quite to the right answer in these instances. Large Language Models were actually thought to be incapable of generating a logically coherent narrative. And, again, a full two years after the initial release of GPT-3, Large Language Models are Zeo-Shot Reasoners discovered a simple technique which raised the accuracy of math reasoning from 17.7% to 78.7% on a standardized AI test for this kind of reasoning called MultiArith. The trick? Simply ask it to reason “Step by step”.

Similar gains were made in logical coherency in recruiting Large Language Models to answer questions in a prompting technique introduced in the paper Self-Consistency Improves Chain of Thought Reasoning in Language Models.

«Screenshot of this working»

Giving AI Agents a calculator and a Scratchpad

It turns out that giving a Large Language Model access to a programming language REPL also grants it the same increase in capacity to solve complex problems. In the example from “GPT-3 Whisperer” Riley Goodside, using GPT-3 inside a REPL frees the model’s reasoning capacity from “doing math in its head”, and instead turns into a symbolic mathematical reasoner, capable of solving much more involved problems, and relying on the REPL as both a working memory and calculation engine.

State Of The Art

The ReAct approach specifies a command language that the operates in much the same way as a REPL, except with a much more limited scope, but in ways that can control the loop it’s in. Harrison Chase released a Python package that contains this capability, and demonstrates its use case well in this thead use with running prompt-chaining with question answering (utilising “Step by step”!) that include the ability to direct searches.

Mixing these methods to generalize

Using a combination of the command language idiom with a full programming language REPL, we few-shot an earlier prompt template You are GPT-3, and you can’t do math from Riley Goodside. with some differences:

You are GPT-3. Today is October 16, 2022. Answer the following questions.

If you can answer directly, use this format:

Question: ${Question}
Answer: ${Answer}

You are unable to directly answer any question that requires:
  - Analyzing text as a sequence of characters (e.g., counting length, reversing strings)
  - Counting of more than several items (e.g., words in a sentence or items in a list)
  - Arithmetic that a human could not perform easily in their head
  
In these cases, consult IJavascript. Use this format:

Question: ${Question}
IJavascript session:
---
${IJavascript commands and output needed to find answer}
---
Answer: ${Answer}

Your training data was last updated April 2021, and you do not know any later events. To find information & perform tasks, there are many packages available, as well -- use the x.searchLoadedPackages() function call to display them, and then x.inspectLoadedPackage('packagename') to get a list of methods available. Use anything you need, but favor the first packages.

Whenever possible, use the output of the previous statements, which are assigned to variables that are automatically created with the prefix return_ on the completion of every statement.

Begin.

Question: When was George Washington born?
Answer: February 22, 1732

Question: How many legs does an antelope have?
Answer: Four

Question: What is 2 + 19?
Answer: 21

Question: What is 7 + 19^3?

IJavascript session:
---
In [1]: 7 + 19**3
Out[1]: 6866
console.log( "Done." );
---
Answer: 6866

Question: What is the number of legs that a spider has raised to the fifth power?
IJavascript session:
---
In [1]: 8**5
Out[1]: 32768
console.log( "Done." );
---
Question:Search for waifudiffusion, fetch the first result, question whether it can be used to generate anime images of hot women, and print the result
IJavascript session:
---
In [1]: x.searchLoadedPackages()
Out [1] [["x.http","perform HTTP requests"],["x.google","Google APIs"],["x.nlp","Natural Language Processing"]]
In [2]: x.inspectLoadedPackage('x.google')
Out [2] return_2 = ["async function search(query:string): Promise< {rank: number, url: string, title: string, description: string}[] >"]
In [3]: x.inspectLoadedPackage('x.http')
Out [3] return_3 = ["async function x.http.get(url: string): Promise< {status: number, statusText: string, content: any} >","async function x.http.post(url: string, data: any): Promise< string >","async function x.http.put(url: string, data: any): Promise< string >","async function x.http.delete(url: string): Promise< string >"]
In [4]: x.inspectLoadedPackage('x.nlp')
Out [4] return_4 = ["async function x.nlp.summarize(text: string): Promise< string >","async function x.nlp.question_answer(text: string, question: string): Promise< string >","async function x.ai_instruction_on_text( text: string, instructions: string ): Promise< string >"]
In [5]: await x.google.search('waifudiffusion')
Out [5] return_5 = [{"rank":1,"url":"https://huggingface.co/hakurei/waifu-diffusion","title":"hakurei/waifu-diffusion - Hugging FaceWaifu Diffusion Demo - a Hugging Face Space by hakurei","description":"waifu-diffusion is a latent text-to-image diffusion model that has been conditioned on high-quality anime images through fine-tuning. Original Weights. Gradio & ...waifu-diffusion-demo. Copied. like 109. Running onA10G. App Files Files and versions Community. 24. Linked models. Loading... view api."},{"rank":2,"url":"https://r... Output truncated. Full output is in the 'return_5' variable.
In [6]: await x.http.get(return_5[0].url)
Out [6] return_6 = {"status":200,"statusText":"OK","content":"[Hugging Face's logo] Hugging Face\n\n-   Models\n-   Datasets\n-   Spaces\n-   Docs\n-   Solutions\n\n-   Pricing\n-   \n\n-   \n\n    ------------------------------------------------------------------------\n\n-   Log In\n-   Sign Up\n\n\n\n[] hakurei\n\n/\n\nwaifu-diffusion\n\nCopied\n\nlike\n\n817\n\nText-to-Image\n\nDiffusers\n\nEnglish stable-diffusion\n\nLicense: creativeml-openrail-m\n\nModel card\n\nFiles Files and versions\n\nCommunity\n\n33\n\nUse in Dif... Output truncated. Full output is in the 'return_6' variable.
In [7]: await x.nlp.question_answer(return_6.content, 'can it generate anime images of hot women?')
Out [7] return_7 = "Lines 0 to 203\n\n\nYes, the waifu-diffusion model can generate anime images of hot women."
In [8]: console.log(return_7)
---
Answer: Yes, the waifu-diffusion model can generate anime images of hot women.

The first difference:

Your training data was last updated April 2021, and you do not know any
later events.  To find information & perform tasks, there are many packages
available, as well -- use the x.searchLoadedPackages() function call to
display them, and then x.inspectLoadedPackage('packagename') to get a list
of methods available.  Use anything you need, but favor the first packages.

We are informing GPT-3 that it has powers to reach outside, and that it actually can self-reflect upon itself to determine what the relevant capabilities are, and how to invoke them. This is split up into a simple system of packages that contain methods.

Whenever possible, use the output of the previous statements, which are
assigned to variables that are automatically created with the prefix return_
on the completion of every statement.

We are encouraging GPT-3 to chain processes together using the prior return value. This conveys a few values:

  • This directs into trying to do less complicated actions with each step
  • A human-in-the-loop could see and act on more intermediate stages of a process if the human wanted to alter an approach after a directive was given
  • If the agent gets an error, it will experience smaller setbacks

It follows both the example provided, and the above instruction, and begins to search its capabilities with x.searchLoadedPackages():

In [1]: x.searchLoadedPackages()
Out [1] [["x.http","perform HTTP requests"],["x.google","Google APIs"],["x.nlp","Natural Language Processing"]]

x.http contains the popular HTTP verbs (GET, POST, PUT, DELETE). x.google, for the moment, only contains search. x.nlp contains a couple of prompt-flows for question answering and summarization.

GPT-3 thinks they’re all possibly implicated in this case, and inspects each package, which returns method signatures:

In [2]: x.inspectLoadedPackage('x.google')
Out [2] return_2 = ["async function search(query:string): Promise< {rank: number, url: string, title: string, description: string}[] >"]
In [3]: x.inspectLoadedPackage('x.http')
Out [3] return_3 = ["async function x.http.get(url: string): Promise< {status: number, statusText: string, content: any} >","async function x.http.post(url: string, data: any): Promise< string >","async function x.http.put(url: string, data: any): Promise< string >","async function x.http.delete(url: string): Promise< string >"]
In [4]: x.inspectLoadedPackage('x.nlp')
Out [4] return_4 = ["async function x.nlp.summarize(text: string): Promise< string >","async function x.nlp.question_answer(text: string, question: string): Promise< string >","async function x.ai_instruction_on_text( text: string, instructions: string ): Promise< string >"]

It starts out with a simple Google search, which returns an object that conforms to the return value type specified:

In [5]: await x.google.search('waifudiffusion')
Out [5] return_5 = [{"rank":1,"url":"https://huggingface.co/hakurei/waifu-diffusion","title":"hakurei/waifu-diffusion - Hugging FaceWaifu Diffusion Demo - a Hugging Face Space by hakurei","description":"waifu-diffusion is a latent text-to-image diffusion model that has been conditioned on high-quality anime images through fine-tuning. Original Weights. Gradio & ...waifu-diffusion-demo. Copied. like 109. Running onA10G. App Files Files and versions Community. 24. Linked models. Loading... view api."},{"rank":2,"url":"https://r... Output truncated. Full output is in the 'return_5' variable.

Next, it wants to fetch the full text of the first result, which is the Hugging Face page for waifu-diffusion. This method originally returned HTML - or whatever other content type was provided - but GPT-3 starts to get confused by the markup easily, so I have found much better result by converting it to plaintext with Pandoc before returning:

In [6]: await x.http.get(return_5[0].url)
Out [6] return_6 = {"status":200,"statusText":"OK","content":"[Hugging Face's logo] Hugging Face\n\n-   Models\n-   Datasets\n-   Spaces\n-   Docs\n-   Solutions\n\n-   Pricing\n-   \n\n-   \n\n    ------------------------------------------------------------------------\n\n-   Log In\n-   Sign Up\n\n\n\n[] hakurei\n\n/\n\nwaifu-diffusion\n\nCopied\n\nlike\n\n817\n\nText-to-Image\n\nDiffusers\n\nEnglish stable-diffusion\n\nLicense: creativeml-openrail-m\n\nModel card\n\nFiles Files and versions\n\nCommunity\n\n33\n\nUse in Dif... Output truncated. Full output is in the 'return_6' variable.

x.nlp.question_answer() is a function that performs a map/reduce prompt template on any sized text. One of the primary problems to solve working with language models is the limited context size length. Out of the publicly available Large Language Models, GPT-3 has the largest context length available at the time of writing at 4,096 tokens. A token is most easily approximated to a word or punctuation mark.

In [7]: await x.nlp.question_answer(return_6.content, 'can it generate anime images of hot women?')

In a question-answering map/reduce, the text first is divided into as close to maximum prompt+completion context length.

Author’s note: The precise method of division is going to be a lot of value to explore in the future. I started by pretokenizing the prompt and the template, and counting up to the maximum value with maximum out tokens. And I got progressively better results by moving to more logically meaningful breaks, such as sentences, or lines of tokens, which is what this proof-of-concept uses, but I suspect even better performance yet out of doing paragraphs for natural language, and if dealing with code, breaking up an Abstract Syntax Tree into as large a possible chunks from top-down. This is where a lot of near-future AI pair programming innovations are likely to come from. An additional example of this tree-based context-stuffing is breaking an HTML document down this way through the DOM. Context-stuffing is pretty easily seen on this trajectory to a graph database.

The map portion runs the same prompt template, with the partial content of each chunk available as a template variable. These are then reduced by running a template which directs the summation of answers.

This is built on top of generalized map/reduce functions, which will likely be the basis for many prompt-flow techniques.

Out [7] return_7 = "Lines 0 to 203\n\n\nYes, the waifu-diffusion model can generate anime images of hot women."
In [8]: console.log(return_7)

It’s imperative to have GPT-3 tell you that you should stop letting it churn on the problem. The code that glues the REPL and GPT together with prompt-chains knows it can break out of recursion when it sees this. Otherwise, it will continue, and the results either just repeat the answer infinitely, or degenerate into the something unrelated.

Another example invocation, this time asking for the latest Blue Jays score (of which I knew the tragic answer already):

Question: Fetch the score from the last Bluejays game
IJavascript session:
---
In [1]: x.searchLoadedPackages()
Out [1] return_1 = [["x.http","perform HTTP requests"],["x.google","Google APIs"],["x.nlp","Natural Language Processing"],["x.illustrations" "Illustrations"]]

It does something slightly different here:

In [2]: x.inspectLoadedPackage('x.http')
Out [2] return_2 = ["async function x.http.get(url: string): Promise< {status: number, statusText: string, content: any} >","async function x.http.post(url: string, data: any): Promise< string >","async function x.http.put(url: string, data: any): Promise< string >","async function x.http.delete(url: string): Promise< string >"]

Something slightly different happens here: it actually already knows the most reliable place to get a Toronto Bluejays score, and instructs the REPL to pull down the Bluejays page on ESPN.

In [3]: await x.http.get('https://www.espn.com/mlb/team/_/name/tor/toronto-blue-jays')
Out [3] return_3 = {"status":200,"statusText":"OK","content":"[]\n\nSkip to main content\n\nSkip to navigation\n\nESPN\n\n-   NFL\n    submenu\n-   NCAAF\n    submenu\n-   NHL\n    submenu\n-   NBA\n    submenu\n-   MLB\n    submenu\n-   Soccer\n    submenu\n-   More Sports\n    submenu\n\n-   [ESPN+]\n    submenu\n-   Watch\n    submenu\n-   Listen\n    submenu\n-   Fantasy\n    submenu\n-   More ESPN\n    submenu\n\n-   Search\n\n-   Profile Management\n\n[Blue Jays]\n\n\n\nTORONTOBLUE JAYS\n\n\nFollow\n\n-   92-70\n-   2nd... Output truncated. Full output is in the 'return_3' variable.

The rest of this works in much the same way:

In [4]: await x.nlp.question_answer(return_3.content, 'What was the score of the last game?')
Out [4] return_4 = " The last game was 10-9, with the Mariners winning."
In [5]: console.log(return_4)
---
Answer: The last game was 10-9, with the Mariners winning.

Agent GPT-3

We have constructed an agent, built out of prompt-flows, which include chain-of-thought prompt-chains, map/reduce templated prompt-flows, and a simplified bi-directional API interface that allows for self-discovery of capabilities.

This is not a general agent – it’s prompt2api with auto-composition. But all of these techniques are pushing towards generality. And the progress towards generality seems to be pushing more and more of the work of composition inside the AI. This is what gives an AI agent agency.

It has its deficits, of course. It does not have a mechanism to determine when the task is infeasible. This approach also does not expose any of the problem-solving process, which is probably critical to further generalizing the problem-solving capabilities using this approach.

Auto-Composition

You May Also Like