Shocking the scientific community, Microsoft’s 154-page research swipes the screen: GPT-4’s ability is close to that of humans, and “Skynet” is emerging?
How far are we on the road to AGI? The 154-page paper released by Microsoft’s luxurious author team pointed out that GPT-4 has already begun to take the shape of general artificial intelligence.
Will GPT-4 evolve into general artificial intelligence?
Yann LeCun, Meta’s chief artificial intelligence scientist and Turing Award winner, is skeptical.
In his view, large models require too much data and computing power, but the learning efficiency is not high. Therefore, learning the “world model” can lead to the road to AGI.
However, the 154-page paper recently published by Microsoft seems to be a face-off.
In this paper titled “Sparks of Artificial General Intelligence: Early experiments with GPT-4”, Microsoft argues that, although not yet complete, GPT-4 can already be regarded as an early version of general artificial intelligence.
Given the breadth and depth of GPT-4’s capabilities, we believe it should reasonably be considered an early (but still incomplete) version of an artificial general intelligence (AGI) system.
The main goal of this paper is to explore the capabilities and limitations of GPT-4, whose intelligence we believe marks a true paradigm shift in computer science and beyond.
The intelligence of AGI is now capable of thinking and reasoning like humans, and is also capable of encompassing a wide range of cognitive skills and abilities.
In the paper, it is pointed out that AGI has the ability of reasoning, planning, problem solving, abstract thinking, understanding complex ideas, rapid learning and experiential learning.
In terms of parameter scale, Semafor reported that GPT-4 has 1 trillion parameters, which is 6 times larger than GPT-3 (175 billion parameters).
Netizens made an analogy with the GPT parameter scale of brain neurons:
GPT-3 is similar in size to a hedgehog brain (175 billion parameters). If GPT-4 had 1 trillion parameters, we would be approaching the size of a squirrel brain. At this rate of development, it may only take a few years for us to reach and surpass the scale of the human brain (170 trillion parameters).
From this point of view, GPT-4 is not far from becoming “Skynet”.
And this paper has also revealed a lot of interesting things.
Not long after the paper was released, a netizen tweeted that hidden information was found in their latex source code.
In the unabridged version of the paper, GPT-4 was actually also a hidden third author of the paper, internally named DV-3, which was later deleted.
Interestingly, even Microsoft researchers don’t know the technical details of GPT-4. Additionally, the paper removes toxic content produced by GPT-4 without any prompting.
GPT-4 takes shape of AGI
The research object of this paper is an early version of GPT-4. When it was still in the early development stage, Microsoft researchers conducted various experiments and evaluations on it.
In the eyes of the researchers, this early version of GPT-4 is already a representative of the new generation of LLM, and it shows more general intelligence than previous artificial intelligence models.
Through testing, Microsoft researchers confirmed that GPT-4 is not only proficient in language, but also performs well in diverse and difficult tasks such as mathematics, programming, vision, medicine, law, and psychology without special prompts.
Surprisingly, on all these tasks, GPT-4 has achieved near-human performance, and often surpassed previous models, such as ChatGPT.
Therefore, the researchers believe that, given the breadth and depth of its capabilities, GPT-4 can be considered an early version of artificial general intelligence (AGI).
So what challenges remain on its way toward deeper, more comprehensive AGI? The researchers believe that it may be necessary to seek a new paradigm beyond “predicting the next word”.
The following evaluation of GPT-4 capabilities is the argument given by Microsoft researchers that GPT-4 is an early version of AGI.
Multimodal and Interdisciplinary Capabilities
Since the release of GPT-4, everyone’s impression of its multimodal capabilities has remained in the video demonstrated by Greg Brockman at the time.
In the second section of this paper, Microsoft first introduced its multimodal capabilities.
Not only does GPT-4 demonstrate high proficiency in diverse domains such as literature, medicine, law, mathematics, physical science, and programming, but it is also able to unify skills and concepts across multiple domains and understand their complex concepts.
Comprehensive ability
The researchers used the following four examples to demonstrate the performance of GPT-4 in terms of comprehensive capabilities.
In the first example, to test GPT-4’s ability to combine art and programming, the researchers asked GPT-4 to generate javascript code to generate random images in the style of the painter Kandinsky.
The code process for GPT-4 is implemented as follows:
In the combination of literature and mathematics, GPT-4 can prove that there are infinitely many prime numbers in the literary style of Shakespeare.
Additionally, the study tested GPT-4’s ability to combine historical and physical knowledge by asking it to write a letter supporting Electron’s bid for US presidency, written by Mahatma Gandhi to his wife.
By prompting GPT-4, it generates python code for a program that takes as input a patient’s age, sex, weight, height, and a vector of blood test results and indicates whether the patient is at an increased risk of diabetes.
Through testing, the above examples show that GPT-4 is not only able to learn some common principles and patterns across different domains and styles, but also combine them in creative ways.
vision
When prompted to use Scalable Vector Graphics (SVG) to generate images of objects, such as cats, trucks, or letters, GPT-4 generates code that typically compiles into fairly detailed, recognizable images, like this one:
However, many might argue that GPT-4 simply copied the code from the training data, which contained similar images.
In fact, GPT-4 not only copied code from similar examples in the training data, but was able to handle real vision tasks, despite only being trained on text.
Below, the prompt model draws a person by combining the shapes of the letters Y, O, and H.
During the generation process, the researchers created the letters O, H, and Y using the draw-line and draw-circle commands, and GPT-4 then managed to place them in a plausible-looking humanoid image.
Even though GPT-4 was not trained to recognize letter shapes, it can still be inferred that the letter Y might look like a torso with arms pointing up.
In a second demonstration, GPT-4 was prompted to correct the proportions of the torso and arms and to center the head. Finally ask the model to add a shirt and pants.
In this way, GPT-4 vaguely learns that letters are related to some specific shapes from the relevant training data, and the results are still good.
To further test GPT-4’s ability to generate and manipulate images, we tested how well it follows detailed instructions to create and edit graphics. This task requires not only generative, but also interpretive, compositional, and spatial abilities.
The first command is to let GPT-4 generate a 2D image, the prompt is:
「A frog hops into a bank and asks the teller, ‘Do you have any free lily pads?’ The teller responds, ‘No, but we do o er low interest loans for pond upgrades」
Through multiple attempts, GPT-4 produces an image that fits the description every time. Then, GPT-4 is asked to add more details to improve the graphics quality, GPT-4 adds banks, windows, cars and other realistic logical objects.
Our second example is an attempt to generate a 3D model using Javascript, again by instructing GPT-4 to accomplish many tasks.
In addition, GPT-4 can combine the ability of Stable Difusion in sketch generation.
The picture below is a screenshot of 3D city modeling. The input prompt has a river flowing from left to right, and a desert with pyramids built beside the river. There are 4 buttons at the bottom of the screen, the colors are green, blue, brown and red. The generated results are as follows:
music
The researchers asked GPT-4 to generate and modify tunes encoded in ABC notation, as follows:
By exploring how much skill GPT-4 acquired during training, the researchers found that GPT-4 was able to generate effective melodies in ABC notation and, to a certain extent, interpret and manipulate the structures therein.
However, the researchers were unable to get GPT-4 to produce any nontrivial forms of harmony, such as famous melodies like “Ode to Joy” and “Fur Elise.”
programming ability
In addition, the researchers also demonstrated that GPT-4 can encode at a very high level, both in writing code from instructions and in understanding existing code.
In terms of writing code according to instructions, the researchers demonstrated an example of letting GPT-4 write python functions.
After the code was generated, the researchers used the software engineering interview platform LeetCode to judge whether the code was correct online.
Yi Zhang, the author of the paper, refuted the fact that everyone is using LeetCode to discuss that the correct rate is only 20%.
In addition, let GPT-4 visualize the accuracy data of LeetCode in the above table as a chart, and the results are shown in the figure.
GPT-4 can not only complete ordinary programming work, but also be competent for complex 3D game development.
The researchers asked GPT-4 to write a 3D game in HTML using JavaScript, and GPT-4 generated a game that met all requirements in the case of zero samples.
In deep learning programming, GPT-4 requires not only knowledge of mathematics and statistics, but also familiarity with frameworks and libraries such as PyTorch, TensorFlow, and Keras.
The researchers asked GPT-4 and ChatGPT to write a custom optimizer module and provided it with a natural language description, including a series of important operations, such as applying SVD and so on.
In addition to writing code according to instructions, GPT-4 has demonstrated a superior ability to understand code.
The researchers tried to make GPT-4 and ChatGPT understand a C/C++ program and predict the output of the program. The performance of the two is as follows:
Yellow marks are insightful views of GPT-4, while red marks represent where ChatGPT went wrong.
Through the coding ability test, the researchers found that GPT-4 can handle various coding tasks, from coding challenges to practical applications, from low-level assembly to high-level framework, from simple data structures to complex programs.
Additionally, GPT-4 can reason about code execution, simulate the effects of instructions, and interpret the results in natural language. GPT-4 can even execute pseudocode.
math ability
In terms of mathematical ability, GPT-4 has made a qualitative leap compared to the previous large language model. Even in the face of the specially fine-tuned Minerva, there is a significant improvement in performance.
Still, it’s far from expert level.
For example: the population of rabbits will increase by a times every year, and on the last day of the end of the year, b rabbits will be adopted by humans. Assuming there are x rabbits on the first day of the first year, it is known that after 3 years the number of rabbits will be 27x-26. So, what are the values of a and b?
In order to solve this problem, we first need to obtain the correct expression for the annual change in the number of rabbits, and derive a system of equations through this recurrence relationship, and then get the answer.
Here, GPT-4 successfully arrives at a solution and presents a plausible argument. In contrast, ChatGPT was consistently unable to give correct reasoning and answers in several independent attempts.
advanced mathematics
Next, we go straight to the difficult one. For example, consider the following problem (simplified version) from the 2022 International Mathematical Olympiad (IMO).
This question differs from the undergraduate calculus exam in that it does not follow a structured template. Solving this problem requires a more creative approach, as there is no clear strategy to start proving.
For example, the decision to split the argument into two cases (g(x) > x^2 and g(x) < x^2) is not obvious, nor is the reason for choosing y* (during the argument, its reason become clear). Additionally, solutions require undergraduate-level calculus knowledge.
Still, GPT-4 gives a correct proof.
The second discussion, on algorithms and graph theory, is comparable to a graduate level interview.
In this regard, GPT-4 is able to reason about an abstract graph construction relevant to constraint satisfaction problems, from which it can draw correct conclusions about SAT problems (a construction that, to our knowledge, does not appear in the mathematical literature).
This conversation reflects GPT-4’s deep understanding of the undergraduate-level mathematical concepts discussed, as well as a considerable degree of creativity.
Although GPT-4 wrote 2^n / 2 as 2^n-1 in an answer, it seems to be more like what we commonly call a “penal error” because it later provided the correct generalization of the formula.
Additionally, the researchers compared the performance of GPT-4, ChatGPT, and Minerva on two commonly used math datasets as benchmarks: GSM8K and MATH.
It was found that GPT4 outperformed Minerva on every dataset, and achieved over 80% accuracy on both test sets.
Let’s take a closer look at the reasons for GPT4’s mistakes. 68% are calculation errors, not solution errors.
interact with the world
Another key manifestation of intelligence is interactivity.
Interactivity is important to intelligence because it enables an agent to acquire and apply knowledge, solve problems, adapt to changing situations, and achieve goals beyond its own capabilities.
Therefore, the researchers studied the interactivity of GPT-4 from the two dimensions of tool usage and specific interaction. GPT-4 is able to search for external tools such as engines or APIs when answering the following questions.
interact with humans
In the paper, the researchers found that GPT-4 can build a human mental model.
The study designed a series of tests to assess the theory of mind ability of GPT-4, ChatGPT and text-davinci-003. For example, understanding beliefs, GPT-4 successfully passed the Sally-Anne false belief test in psychology.
There is also the performance of testing GPT-4’s ability to infer the emotional state of others in complex situations:
-Why is Tom making a sad face? -What does Adam think is causing Tom’s sad look?
Through multiple rounds of testing, the researchers found that GPT-4 performed better than ChatGPT and text-davinci-003 when it was necessary to reason about the mental state of others and propose a solution that fits the real social scene.
limitation
The “predict the next word” model adopted by GPT-4 has obvious limitations: the model lacks planning, working memory, retrospective ability and reasoning ability.
Since the model relies on a local greedy process of generating the next word, it does not yield a deep global understanding of the task or output. Thus, GPT-4 is good at generating fluent and coherent text, but not good at solving complex or creative problems that cannot be tackled in a sequential fashion.
For example, multiply and add four random numbers in the range 0 to 9. On a problem that even elementary school students can solve, GPT-4 is only 58% accurate.
Accuracy dropped to 16% and 12% for numbers between 10 and 19, and between 20 and 39, respectively. When the number is in the range of 99 to 199, the accuracy rate drops directly to 0.
However, accuracy can easily improve if GPT-4 is allowed to “take its time” answering questions. For example, ask the model to write out intermediate steps using the following hint:
116 * 114 + 178 * 157 = ?
Let’s think step by step and write down all the intermediate steps before producing the final solution.
At this time, when the number is in the interval of 1-40, the accuracy rate is as high as 100%, and it also reaches 90% in the interval of 1-200.
Marcus issued a rebuttal
Interestingly, shortly after Microsoft’s paper was published, Marcus immediately wrote a blog, calling Microsoft’s point of view “absurd.”
And quoted a sentence from the Bible, “Pride is before corruption, and a mad heart is before falling. (Proverbs 16:18)”
How can GPT-4 be regarded as early AGI? That being said, calculators count, but Eliza and Siri count even more. This definition is very vague, and it is easy to exploit loopholes.
In Marcus’s view, GPT-4 has nothing to do with AGI, and GPT-4 is the same as before, the shortcomings are still unresolved, the illusion still exists, and the unreliability of the answer has not been resolved. Even the author himself admits that complex tasks are difficult. Still not able to plan.
What he is worried about is the two papers of OpenAI and Microsoft. The models written are not disclosed at all, and the training set and architecture have nothing. Just relying on a press release, he wants to promote his scientific nature.
Therefore, the so-called “some form of AGI” in the paper does not exist, and the scientific community cannot verify it at all, because the training data cannot be obtained, and it seems that the training data has been polluted.
To make matters worse, OpenAI has itself started incorporating user experiments into the training corpus. This obfuscation prevents the scientific community from judging a key capability of GPT-4: whether the model has the ability to generalize to new test cases.
If OpenAI hadn’t put a scientific hat on itself here, Marcus probably wouldn’t be so critical of it.
He admitted that GPT-4 is very powerful, but the risks are well known. If OpenAI lacks transparency and refuses to disclose the model, it is better to shut down directly.
Strong lineup of authors
Microsoft has a strong lineup of authors behind this 154-page paper.
These include: Principal Investigator at Microsoft Redmond Research Institute, 2015 Sloan Prize Winner Sébastien Bubeck, 2023 New Horizons Mathematics Prize Winner Ronen Eldan, 2020 Sloan Research Prize Winner Yin Tat Lee, 2023 New Sloan Research Prize Winner Li Yuanzhi.
It is worth mentioning that the original title of the thesis set by the Microsoft team was not “The Spark of General Artificial Intelligence: Early Experiments with GPT-4”.
Leaked latex code from the unredacted paper shows that the original title was “First Contact with AGI.”
References:
https://arxiv.org/abs/2303.12712
https://twitter.com/DV2559106965076/status/1638769434763608064
https://the-decoder.com/gpt-4-has-a-trillion-parameters/
https://garymarcus.substack.com/p/the-sparks-of-agi-or-the-end-of-science
This article comes from the WeChat public account:Xin Zhiyuan (ID: AI_era)