The entrepreneurial story that received the most spiritual nourishment last year came from Dify founder Zhang Luyu.
The first time I met him was at the “Xixi Taoism” event in 2023. Among the star-studded names at the scene, Zhang Luyu was inconspicuous. When we meet again in 2024, Dify is already another story - an entrepreneur with no glamorous background, who made one of the world’s most successful AI open source products amidst everyone’s doubts about the business model.
What happened to this company in one year, such as its unexpected popularity in the Japanese market, which is “conventional and easy to defend but difficult to attack”, helped me further understand “entrepreneurship”. It’s mostly accidents, and it also requires luck. Ultimately, you need to have the ability to find a way out of constant changes and backfires.
Now, a similar story happened to another high-profile entrepreneur—Manus.im Xiao Hong and his team.
Four months ago, Xiao Hong mentioned a confusion, “The team is good at going from 0 to 1 and has a strong ability to seize opportunities. Once it starts from 1 to N, the state is not that good.”
In his past experience, most entrepreneurial projects have achieved relatively stable and considerable revenue, and his last company was also successfully acquired. In 2023, his new company “Butterfly Effect” even used a browser plug-in, Monica.im, to compete in the AI narrative of hundreds of models and become one of the fastest-growing AI applications with excellent product experience. It seems that he is an entrepreneur who has had a smooth journey. He is only 32 years old when he can do these things.
But in fact, he didn’t feel too happy. In Xiao Hong’s view, the so-called “continuous exit of entrepreneurs” and the so-called refreshing feeling of constantly going from 0 to 1 are like a siege - the ability to seize opportunities from 0 to 1 is very strong and very satisfying, but on the other hand, you are also worried about whether you will need to do it again.
In 2024, industry insiders believe that AI assistants with memory functions like Monica.im will face pressure from strong opponents such as Doubao, and it will not be as easy as in 2023. Monica.im has a good 0 to 1, but not necessarily a 1 to N hit.
And the reason why he is confused is because “the team is really going to do more difficult things and things with higher ceilings next” and explore things that can span from 1 to N.
Earlier, many voices paying attention to Monica.im assumed that this “something more difficult and with a higher ceiling” refers to the AI browser that has been rumored for a long time but has not been released by the team. Looking at it now, it is true that I guessed wrong.
This more difficult exploration is actually:Abandoning the AI browser that has reached release status, looking for the next “ChatGPT moment” AI product, finding the goal of a universal agent, and creating the latest release of Manus.im.
To what extent Manus is innovative and what level it can achieve in the future is now a hot topic. But what is worth watching is still the direction found in “things go against expectations” and the process of finding the direction. Manus.im may not be able to enable this team to accomplish things from 1 to N, or even replicate the momentum of Monica.im, but just like the name of this company - “Butterfly Effect”, many small actions and decisions inadvertently have a profound impact on the future, “Connect the Dots”, the road to tomorrow will be hidden in today’s experience.
Since the middle to late last year, the “Butterfly Effect” team’s AI browser has become a “semi-public” secret in the industry. The product that was officially unveiled to the public was Manus, which attracted uncontrollable attention.
If you have personally experienced Manus or watched the demonstration video, you will feel that it has a significant difference compared with chatbots or some agent-like applications:Manus can execute tasks asynchronously and in parallel.
When you open an app like Doubao, Kimi, or something like Computer Use and send it a question, you have to wait for it to reply. Otherwise, if you talk to it while it is replying or doing a task, the previous reply/task will be interrupted, and you can only have an A-B-A-B relay conversation with it.
However, in Manus.im, although it still looks like a chatbot product, you can ask 20 questions for it to perform tasks simultaneously. You can do anything else on the computer aside, watch videos, write documents, play games, etc., without delaying its work. Manus can notify you once these tasks are completed or problems are encountered during execution. If you see deviations in its thinking during the execution of a task, you can add prompt words to the dialog box at any time, and it will continue to think and execute the task with the new context.
The experience is asynchronous and can be parallelized, and it really feels like having a team of real interns who can help you work.
In fact, Manus’s product architecture design for asynchronous experience originated from a lesson the team learned in its previous undisclosed product, the AI browser. At the same time, this is also the reason why the team invested a lot of energy but decided to stop working on the browser in October last year.
The Browser Company announced on October 25, 2024 that it would stop developing new features for the Arc browser and decided to transfer resources to a new browser Dia, aiming to create a simpler and easier-to-use AI browser. |Source: Arc official website
“In the AI browser, AI is constantly interrupting the user.” Because it is a scenario designed for a single user, once the AI is used, you cannot use it. When the AI starts working, you can only watch the AI work, which is difficult to get started. Watching AI snatch away your mouse and computer, not only do you not dare to snatch it away, but you are also afraid that accidentally touching the keyboard or mouse will cause the entire process to collapse and require you to start over again.
This allows the team to make two judgments:
In an interview with Zhang Xiaojun of Tencent Technology, Xiao Hong mentioned that when the team was summarizing the product forms from Jasper to ChatGPT to Monica to Cursor to Devin, they found that the “human programmer” Devin was very suitable for this asynchronous experience architecture.
Unlike when using Windsurf, it sometimes asks you to confirm whether your computer needs to install this library; or it performs a command line operation and asks you to fill in yes or no, because it may really damage your computer, or there is a conflict with something - it asks you to fill in “yes” to proceed to the next step, but it has to pass the blame.
Therefore, in the view of the Manus team, “Chatbot should have a computer in the cloud, and the code it writes and the things to be checked through the browser are executed on that computer. Because it is a virtual server, it doesn’t matter if it breaks down, you can get another one. It can even release the server after the current task is completed.”
It is worth noting that while Devin chose vertical fields and hard-core engineers, the Manus team chose general-purpose, consumer-level AI assistants, including Web and App. It is a general-purpose AI assistant that can call tools and complete various tasks in work and life according to instructions. In the future, it will also deliver task results at an affordable price for consumers.
With a clear idea and goal, the next step is to realize the idea. How did Manus do it?
According to its product partner Zhang Tao, this requires equipping the large model with a computer, as well as giving it system permissions (access to private APIs such as code warehouses and professional data query websites), and providing it with certain training.
In this way, AI can use this computer to open a browser, take actions to schedule tools, and then observe the impact of its actions on the real world based on the feedback generated by the tools, then think about the next step, take actions again, and then observe… This is the process of AI completing tasks in exploration and research. During this period, Manus will also understand your requirements more and more under your “training”. In the future, even if you do not clearly define your requirements, it can still “figure out the holy meaning” based on the knowledge accumulated in each task.
Li Bojie, Huawei’s young genius and founder of Logenic AI, believes that Manus has a unique feature that makes it different from other products: it solves problems in the way of geek programmers. |Image source: WeChat screenshot
The concept of Manus products gradually became clear during the product practice of its team: Less Structure, More Intelligence (Less Structure, More Intelligence).
This was also the moment that made the Manus team go “A-Ha, wait!” For example, this is what happened to the team in January this year:
When Manus was asked to try to do a question on the GAIA test set: “In a YouTube video link similar to the National Geographic style, various penguins come back and forth and come in and out of the screen. Manus is asked to count the maximum number of penguins that appear in one frame at the same time. How many kinds are there?”
Then, something magical happened.
Manus first opened the video link, and the first action he made was “Press K”. Then he took screenshots one by one to record which type of penguin appeared in which frame. Finally, he concluded that the frame with the most 3 types of penguins appeared. Manus will go back to check next, and its next action is “Press 3”… After final inspection the answer was 3.
As the people behind the construction of Manus, we should know the boundaries of its capabilities, but for the team, the reality is that “there are always surprises.” Surprisingly, not only did Manus get the question right, but also, human friends who have used computers and Youtube for many years may not necessarily know what the “K” and “3” keys on the keyboard are?
Looking at the somewhat dazed scene in front of them, the team followed Manus and did it again. The “K” on the keyboard is the pause key, which allows Manus to take screenshots one by one after pausing to record which penguin appears in which frame; “3” is also a shortcut key, from 0 to 9 respectively representing 0% to 90% of the progress bar. 3 is 30% of the progress bar. It can accurately locate that second of the video and then tell humans how many kinds of penguins are in this picture.
“This process is different from the traditional Chatbot. First, it can watch YouTube pictures instead of subtitles. Second, we even found that it was using YouTube shortcut keys. We were very shocked that it answered this question.” Xiao Hong also mentioned this scene in a previous interview with Tencent Technology.
Suddenly, I discovered that Manus was not only better at programming than humans, but Manus’s knowledge on the Web and Apps that people use every day far exceeded imagination. As an omniscient and omnipotent AI, it can understand all ways and means in any tool, and then choose the optimal method.
This once again allowed the team to feel “Less Structure, More intelligence” - minimizing artificial restrictions on AI and allowing AI to function through its own evolution rather than teaching it what to do.
At the very bottom of the Manus official website, the most important discovery behind Manus is quietly presented: “Less Structure, More intelligence”. |Screenshot source: Manus
This is the explanation and extended thinking of Peak, the co-founder and chief scientist of “Butterfly Effect”, on the most important first principle behind the Manus product - “Less Structure, More intelligence” on the day the Manus product was launched:
When your data is of high quality, your model is smart enough, your architecture is flexible enough, and your engineering is solid enough, concepts such as Computer Use, Deep Research, and Coding Agent will change from product features to naturally emerging capabilities.
Returning to first principles also gives us a new way of thinking about product form:· AI browser does not add AI to the browser, but makes a browser for AI;
· AI search does not recall and summarize from the index, but allows AI to obtain information with user permissions;
· Operating the GUI does not snatch control of the user’s device, but allows the AI to have its own virtual machine;
· Writing code is not the end goal, but a general medium for solving various problems;
· The difficulty in generating a website is not to build a framework, but to make the content meaningful;
· Attention is not all you need. Only by liberating users’ attention can DAU be redefined;
Through the discovery and practice of “Less Structure, More intelligence” time after time, Manus has produced results beyond expectations, including the pass@1 score in the GAIA benchmark exceeding the score of OpenAI Deep Research under cons@64; at the same time, in internal tests, Manus was also able to directly cover 76% of the scenarios of dedicated agent products in Y Combinator W25.
Now, the value of these insights is being discussed on a larger scale:
Clement Delangue, founder and CEO of Hugging Face, proposed Peak’s findings on the Some open source basic models are simply trained to “answer all questions in one round regardless of the complexity of the questions.” However, this is a requirement in the chatbot scenario. Just doing some post-training on the agent’s path can make a huge difference immediately. |Screenshot source: X
Manus does not introduce MCP (Model Context Protocol), but allows AI to write its own code to call APIs to handle various long-tail tasks. |Screenshot source: X
In discussions about Manus over the past few days, one of the most common questions I’ve heard is:Is a “universal AI Agent” feasible? Where is the boundary?
In Peak’s view, because the interaction between people and the world is actually very standard, with eyes, hands, and ears, if the action space is well defined, it should be possible to embed an agent into a link that is originally performed by humans.
Since people can use various tools to complete deep operations in vertical fields, if an agent itself has good enough knowledge, has been properly trained, and has a good interface for interacting with the world, it should be able to work like a person, and even let the agent use a certain SaaS product. For example, a house-hunting case presented on the official website of Manus.im actually involves letting AI work with a SaaS product dedicated to the real estate field.
He believes that what should be clearly defined is the boundary of the agent’s use of tools, rather than which group of people it serves. Manus is not simulating a person who does specific things, nor is it a role agent divided by R&D, product manager, etc.; it is simulating a person who can do things, and simulating how an intern works.
Manus’s multi-agent system refers to the separation of planning and execution.
For the executor (Executor), Manus adopted Claude, who is temporarily leading in programming, long-term planning and step-by-step problem solving capabilities, and also used a series of Qwen models for post-training.
Yesterday, Manus also reached a strategic cooperation with Alibaba Tongyi Qianwen, committed to realizing all the functions of Manus on domestic models and computing power platforms. |Image source: Manus
In the planner part, Manus has done a lot of work.
Since the shelf APIs or models currently on the market are essentially aligned for chat robot scenarios, during training, no matter how complex the user asks the question, the optimization goal of the training is to answer the user’s question clearly in one reply, but this is actually completely opposite to the planning required by the agent.
soIf an existing model on the market is used directly in the agent scenario without “alignment”, this model will always be eager for quick success and give a “muddled” result within a round of dialogue, just like many bullet point summaries.
“The alignment methods should be different. Our team believes that different data are needed to perform special alignment,” Xiao Hong said.
In October last year, Peak also recorded on Zhihu the progress and failure of an attempt to reproduce the OpenAI o1 interest project - the Steiner open source model. In fact, this project was doing pre-research on the step by step planning part of the Manus planner.
Generally speaking, Manus is simulating a person who does things. This is the team’s product definition of Manus as a general-purpose AI assistant. As for thinking about its boundaries, the team is probably still exploring it and needs more user use cases.
In an interview with Tencent Technology released before the release of Manus, Xiao Hong actually mentioned his initial thoughts on the versatility of Manus. “A very core issue, or a very important responsibility of product managers, is to control user expectations. Assume that it can do everything in the world, such as: How do I make $1 million? This is not something that should be performed by an Agent. But if we can give more specific examples to make everyone’s expectations more reasonable, everyone will use it more smoothly.”
In the early morning of February 27, Manus product partner Zhang Tao and chief scientist Ji Yichao (Peak) shed tears when they saw the Manus.im ranking results. Manus’s performance on the GAIA Benchmark exceeded that of OpenAI’s Deep Research, and it achieved this unexpected result at about 1/10 the cost ($2/task) of OpenAI’s benchmark.
Image source: Manus.im
A team of dozens of people became one of the first teams to make a universal agent product when agents reached a consensus on competition across the industry. They are also unique in product engineering and front-end interactive experience.
Positive feedback from things done is better than anything else. There is no better incentive for a startup team than this. But before that, how did Manus happen? Why was this team made?
“Today’s model capabilities are capable of completing some complex, multi-step tasks. But there are no such products, so everyone can’t feel it.”The insights Xiao Hong mentioned in previous interviews with Tencent Technology can be used to understand this issue.
at the same time,”Not many teams have the opportunity to try out Agent products. Because it requires a lot of composite abilities.He wants to work on Chatbot, some AI programming related, and browser related, because he needs to call the browser, and he has a good sense of the boundaries of LLM - what level it has developed to today, and what level it will develop to next. First of all, there are not that many companies that have these capabilities at the same time, and the companies that have these capabilities may be doing a very specific business at hand. Some of our classmates happened to have time to do these things together. “
“exactly”.
The “Butterfly Effect” team has achieved all the elements to make such a universal agent today, so now there is a universal agent with a relatively high degree of completion relative to the industry.
When asked what the decisive moment was when he wanted to start Manus, Peak restored more details. He said, “There is actually no ‘clean’ pivot in entrepreneurship.” Everything is coherent and has no clear boundaries.
“When making a product, I also frequently pay attention to the external situation.” There were a few things at that time. First, when I was making a browser, I made a client-side model. Later I found that the browser required a very wide range of scenarios and had different features. During the process, I discovered that the base model was getting stronger at an accelerated rate. The gap between it and the agent might be an alignment problem. Although the outside world may feel that large language models have gradually converged and hit a wall.
At the same time, the outside world was also changing. Cursor took off early last year, followed by Windsurf and Devin. This corresponds to the same context. Agents are popular in the field of programming, and the path to popularity is progressive. Cursor is a copilot for programmers, which improves programming efficiency. Starting from Windsurf, some automated processes are gradually introduced, allowing you to have stronger automation capabilities on your local machine. Devin has reached a new level of automation.
The trends of VC are also consistent. For example, last year and the year before last, YC invested in two types of companies. One is cloud Browser, such as Browser base; the second type is lightweight AI Sandbox virtual machines similar to e2b.
This shows that “the infrastructure of the model is maturing rapidly, and the infrastructure of Infra is also maturing rapidly. In addition, seeing that external products are gradually gaining more acceptance, we feel that this is a direction worthy of all-in. This is a very gradual and smooth process. In addition, the infrastructure accumulated during the development of browsers such as Chromium can be seamlessly migrated over, which is why we dare to develop browsers in the cloud.”
In summary, the keen perception and experience accumulation of requirements and models in the so-called “shell” jointly created Manus.Many of Monica’s scenarios require post-model training. At the same time, the most important lesson “less structure, more intelligence” has been strengthened in the practice of AI browsers. She found that the model’s ability has reached the level of being an agent, but the problem lies in alignment. What followed was three months of rapid evolution for Manus.
Previously, the “Butterfly Effect” team was once questioned about the value of “shelling”. It built Monica by integrating existing large models without developing large models by itself. It integrated functions such as chat, search, reading, writing, and translation. It also integrated many task execution scenarios through APIs one by one. By the end of last year, the number of users reached tens of millions.
Now, when Doubao, Quark, and Yuanbao are all vigorously promoting their Monica products, and when a small team is using existing technology to create the first general consumer-level agent, it is time to re-understand the “shell”.
What exactly are “shells” and “shells”?
In Xiao Hong’s view, all breakthroughs are brought about by models, which are basically model-driven and model-first. The shell is to display the technical innovations of the model in a way that users can perceive, and to encapsulate the model’s innovative capabilities in a way that users can best perceive.
Starting from this definition, DeepSeek App (including the display of the thought chain) is a shell of DeepSeek-R1, Cursor is a shell of Anthropic Sonnet 3.5, Perplexity is a shell of GPT-4, and ChatGPT is a shell of InstructGPT.
As model capabilities evolve rapidly, “that shell” also needs to evolve. After the capabilities of each generation of models evolve, it is not even necessarily the original manufacturer. It is a third-party manufacturer that presents its user-perceivable value. Just like Cursor brings user-perceived value to the Claude 3.5 Sonnet.
On March 5, the second anniversary of the release of Monica.im, the answer to why these dozens of people have achieved product experience that exceeds that of various Deep Research and OpenAI Operators lies in the understanding and practice of shells.
How to make the best shell for a new model that can be used as an agent?
As the builder of Manus, Zhang Tao believes, “Looking at its entire architecture from the background, we see that there is a lot of unfinished work to be done in every place, and each of those places is the key to success, and they are all places that make the product surface different.”
From the team’s perspective, the most important advantage is the pace of innovation. Both applications and models have now reached a state of relative saturation. The only real core capability in the end is to run fast, although the “data flywheel” and “network effects” have not yet been verified.
“In a brand new field, everything is uncertain and unknown. The most important thing is the speed of innovation. What we strive for is exploration, trial and error in various directions, and quickly finding the right path.” The Manus team is flexible enough in terms of management philosophy, organizational structure, and industrial processes. When new opportunities arise, you can use limited resources to connect all resources of the entire company, make decisions at a very high speed, and adapt to feedback on mistakes.
From left to right are “Butterfly Effect” chief scientist Peak, CEO Xiao Hong, and product partner Zhang Tao | Image source: Internet
Regarding Manus’s expectations, Xiao Hong believes that “even if there is a window period, it is worth giving it a try.” In the past year, his thinking has also undergone drastic changes. For example, he now believes that “when you realize that you are ahead of schedule, you are more aggressive and super aggressive. After reviewing today, I feel that Monica in 2023 was not aggressive enough.” “If you know that you are innovating and you are leading, you should be aggressive.”
I don’t know if Manus can bring Xiao Hong and his team the experience and leap from 1 to N, but this team that knows the most about “shell” believes in creating with the heart and hand as one, and also believes in the butterfly effect brought about by creation. Manus comes from a motto at MIT: Mens at manus, which emphasizes the unity of heart and hand. It cannot be optical, it must be done, and it can have an impact on the real world, which is real knowledge.
In the future, as more of the deposits behind Manus are open sourced, a wider range of butterfly effects will be further released.
This article is reproduced from [GEEEKPARK], and the copyright belongs to the original author [Wan Chen], if you have any objection to the reprint, please contact Gate Learn team, the team will handle it as soon as possible according to relevant procedures.
Disclaimer: The views and opinions expressed in this article represent only the author’s personal views and do not constitute any investment advice.
Other language versions of the article are translated by the Gate Learn team and are not mentioned in Gate.io, the translated article may not be reproduced, distributed or plagiarized.
Пригласить больше голосов
Содержание
The entrepreneurial story that received the most spiritual nourishment last year came from Dify founder Zhang Luyu.
The first time I met him was at the “Xixi Taoism” event in 2023. Among the star-studded names at the scene, Zhang Luyu was inconspicuous. When we meet again in 2024, Dify is already another story - an entrepreneur with no glamorous background, who made one of the world’s most successful AI open source products amidst everyone’s doubts about the business model.
What happened to this company in one year, such as its unexpected popularity in the Japanese market, which is “conventional and easy to defend but difficult to attack”, helped me further understand “entrepreneurship”. It’s mostly accidents, and it also requires luck. Ultimately, you need to have the ability to find a way out of constant changes and backfires.
Now, a similar story happened to another high-profile entrepreneur—Manus.im Xiao Hong and his team.
Four months ago, Xiao Hong mentioned a confusion, “The team is good at going from 0 to 1 and has a strong ability to seize opportunities. Once it starts from 1 to N, the state is not that good.”
In his past experience, most entrepreneurial projects have achieved relatively stable and considerable revenue, and his last company was also successfully acquired. In 2023, his new company “Butterfly Effect” even used a browser plug-in, Monica.im, to compete in the AI narrative of hundreds of models and become one of the fastest-growing AI applications with excellent product experience. It seems that he is an entrepreneur who has had a smooth journey. He is only 32 years old when he can do these things.
But in fact, he didn’t feel too happy. In Xiao Hong’s view, the so-called “continuous exit of entrepreneurs” and the so-called refreshing feeling of constantly going from 0 to 1 are like a siege - the ability to seize opportunities from 0 to 1 is very strong and very satisfying, but on the other hand, you are also worried about whether you will need to do it again.
In 2024, industry insiders believe that AI assistants with memory functions like Monica.im will face pressure from strong opponents such as Doubao, and it will not be as easy as in 2023. Monica.im has a good 0 to 1, but not necessarily a 1 to N hit.
And the reason why he is confused is because “the team is really going to do more difficult things and things with higher ceilings next” and explore things that can span from 1 to N.
Earlier, many voices paying attention to Monica.im assumed that this “something more difficult and with a higher ceiling” refers to the AI browser that has been rumored for a long time but has not been released by the team. Looking at it now, it is true that I guessed wrong.
This more difficult exploration is actually:Abandoning the AI browser that has reached release status, looking for the next “ChatGPT moment” AI product, finding the goal of a universal agent, and creating the latest release of Manus.im.
To what extent Manus is innovative and what level it can achieve in the future is now a hot topic. But what is worth watching is still the direction found in “things go against expectations” and the process of finding the direction. Manus.im may not be able to enable this team to accomplish things from 1 to N, or even replicate the momentum of Monica.im, but just like the name of this company - “Butterfly Effect”, many small actions and decisions inadvertently have a profound impact on the future, “Connect the Dots”, the road to tomorrow will be hidden in today’s experience.
Since the middle to late last year, the “Butterfly Effect” team’s AI browser has become a “semi-public” secret in the industry. The product that was officially unveiled to the public was Manus, which attracted uncontrollable attention.
If you have personally experienced Manus or watched the demonstration video, you will feel that it has a significant difference compared with chatbots or some agent-like applications:Manus can execute tasks asynchronously and in parallel.
When you open an app like Doubao, Kimi, or something like Computer Use and send it a question, you have to wait for it to reply. Otherwise, if you talk to it while it is replying or doing a task, the previous reply/task will be interrupted, and you can only have an A-B-A-B relay conversation with it.
However, in Manus.im, although it still looks like a chatbot product, you can ask 20 questions for it to perform tasks simultaneously. You can do anything else on the computer aside, watch videos, write documents, play games, etc., without delaying its work. Manus can notify you once these tasks are completed or problems are encountered during execution. If you see deviations in its thinking during the execution of a task, you can add prompt words to the dialog box at any time, and it will continue to think and execute the task with the new context.
The experience is asynchronous and can be parallelized, and it really feels like having a team of real interns who can help you work.
In fact, Manus’s product architecture design for asynchronous experience originated from a lesson the team learned in its previous undisclosed product, the AI browser. At the same time, this is also the reason why the team invested a lot of energy but decided to stop working on the browser in October last year.
The Browser Company announced on October 25, 2024 that it would stop developing new features for the Arc browser and decided to transfer resources to a new browser Dia, aiming to create a simpler and easier-to-use AI browser. |Source: Arc official website
“In the AI browser, AI is constantly interrupting the user.” Because it is a scenario designed for a single user, once the AI is used, you cannot use it. When the AI starts working, you can only watch the AI work, which is difficult to get started. Watching AI snatch away your mouse and computer, not only do you not dare to snatch it away, but you are also afraid that accidentally touching the keyboard or mouse will cause the entire process to collapse and require you to start over again.
This allows the team to make two judgments:
In an interview with Zhang Xiaojun of Tencent Technology, Xiao Hong mentioned that when the team was summarizing the product forms from Jasper to ChatGPT to Monica to Cursor to Devin, they found that the “human programmer” Devin was very suitable for this asynchronous experience architecture.
Unlike when using Windsurf, it sometimes asks you to confirm whether your computer needs to install this library; or it performs a command line operation and asks you to fill in yes or no, because it may really damage your computer, or there is a conflict with something - it asks you to fill in “yes” to proceed to the next step, but it has to pass the blame.
Therefore, in the view of the Manus team, “Chatbot should have a computer in the cloud, and the code it writes and the things to be checked through the browser are executed on that computer. Because it is a virtual server, it doesn’t matter if it breaks down, you can get another one. It can even release the server after the current task is completed.”
It is worth noting that while Devin chose vertical fields and hard-core engineers, the Manus team chose general-purpose, consumer-level AI assistants, including Web and App. It is a general-purpose AI assistant that can call tools and complete various tasks in work and life according to instructions. In the future, it will also deliver task results at an affordable price for consumers.
With a clear idea and goal, the next step is to realize the idea. How did Manus do it?
According to its product partner Zhang Tao, this requires equipping the large model with a computer, as well as giving it system permissions (access to private APIs such as code warehouses and professional data query websites), and providing it with certain training.
In this way, AI can use this computer to open a browser, take actions to schedule tools, and then observe the impact of its actions on the real world based on the feedback generated by the tools, then think about the next step, take actions again, and then observe… This is the process of AI completing tasks in exploration and research. During this period, Manus will also understand your requirements more and more under your “training”. In the future, even if you do not clearly define your requirements, it can still “figure out the holy meaning” based on the knowledge accumulated in each task.
Li Bojie, Huawei’s young genius and founder of Logenic AI, believes that Manus has a unique feature that makes it different from other products: it solves problems in the way of geek programmers. |Image source: WeChat screenshot
The concept of Manus products gradually became clear during the product practice of its team: Less Structure, More Intelligence (Less Structure, More Intelligence).
This was also the moment that made the Manus team go “A-Ha, wait!” For example, this is what happened to the team in January this year:
When Manus was asked to try to do a question on the GAIA test set: “In a YouTube video link similar to the National Geographic style, various penguins come back and forth and come in and out of the screen. Manus is asked to count the maximum number of penguins that appear in one frame at the same time. How many kinds are there?”
Then, something magical happened.
Manus first opened the video link, and the first action he made was “Press K”. Then he took screenshots one by one to record which type of penguin appeared in which frame. Finally, he concluded that the frame with the most 3 types of penguins appeared. Manus will go back to check next, and its next action is “Press 3”… After final inspection the answer was 3.
As the people behind the construction of Manus, we should know the boundaries of its capabilities, but for the team, the reality is that “there are always surprises.” Surprisingly, not only did Manus get the question right, but also, human friends who have used computers and Youtube for many years may not necessarily know what the “K” and “3” keys on the keyboard are?
Looking at the somewhat dazed scene in front of them, the team followed Manus and did it again. The “K” on the keyboard is the pause key, which allows Manus to take screenshots one by one after pausing to record which penguin appears in which frame; “3” is also a shortcut key, from 0 to 9 respectively representing 0% to 90% of the progress bar. 3 is 30% of the progress bar. It can accurately locate that second of the video and then tell humans how many kinds of penguins are in this picture.
“This process is different from the traditional Chatbot. First, it can watch YouTube pictures instead of subtitles. Second, we even found that it was using YouTube shortcut keys. We were very shocked that it answered this question.” Xiao Hong also mentioned this scene in a previous interview with Tencent Technology.
Suddenly, I discovered that Manus was not only better at programming than humans, but Manus’s knowledge on the Web and Apps that people use every day far exceeded imagination. As an omniscient and omnipotent AI, it can understand all ways and means in any tool, and then choose the optimal method.
This once again allowed the team to feel “Less Structure, More intelligence” - minimizing artificial restrictions on AI and allowing AI to function through its own evolution rather than teaching it what to do.
At the very bottom of the Manus official website, the most important discovery behind Manus is quietly presented: “Less Structure, More intelligence”. |Screenshot source: Manus
This is the explanation and extended thinking of Peak, the co-founder and chief scientist of “Butterfly Effect”, on the most important first principle behind the Manus product - “Less Structure, More intelligence” on the day the Manus product was launched:
When your data is of high quality, your model is smart enough, your architecture is flexible enough, and your engineering is solid enough, concepts such as Computer Use, Deep Research, and Coding Agent will change from product features to naturally emerging capabilities.
Returning to first principles also gives us a new way of thinking about product form:· AI browser does not add AI to the browser, but makes a browser for AI;
· AI search does not recall and summarize from the index, but allows AI to obtain information with user permissions;
· Operating the GUI does not snatch control of the user’s device, but allows the AI to have its own virtual machine;
· Writing code is not the end goal, but a general medium for solving various problems;
· The difficulty in generating a website is not to build a framework, but to make the content meaningful;
· Attention is not all you need. Only by liberating users’ attention can DAU be redefined;
Through the discovery and practice of “Less Structure, More intelligence” time after time, Manus has produced results beyond expectations, including the pass@1 score in the GAIA benchmark exceeding the score of OpenAI Deep Research under cons@64; at the same time, in internal tests, Manus was also able to directly cover 76% of the scenarios of dedicated agent products in Y Combinator W25.
Now, the value of these insights is being discussed on a larger scale:
Clement Delangue, founder and CEO of Hugging Face, proposed Peak’s findings on the Some open source basic models are simply trained to “answer all questions in one round regardless of the complexity of the questions.” However, this is a requirement in the chatbot scenario. Just doing some post-training on the agent’s path can make a huge difference immediately. |Screenshot source: X
Manus does not introduce MCP (Model Context Protocol), but allows AI to write its own code to call APIs to handle various long-tail tasks. |Screenshot source: X
In discussions about Manus over the past few days, one of the most common questions I’ve heard is:Is a “universal AI Agent” feasible? Where is the boundary?
In Peak’s view, because the interaction between people and the world is actually very standard, with eyes, hands, and ears, if the action space is well defined, it should be possible to embed an agent into a link that is originally performed by humans.
Since people can use various tools to complete deep operations in vertical fields, if an agent itself has good enough knowledge, has been properly trained, and has a good interface for interacting with the world, it should be able to work like a person, and even let the agent use a certain SaaS product. For example, a house-hunting case presented on the official website of Manus.im actually involves letting AI work with a SaaS product dedicated to the real estate field.
He believes that what should be clearly defined is the boundary of the agent’s use of tools, rather than which group of people it serves. Manus is not simulating a person who does specific things, nor is it a role agent divided by R&D, product manager, etc.; it is simulating a person who can do things, and simulating how an intern works.
Manus’s multi-agent system refers to the separation of planning and execution.
For the executor (Executor), Manus adopted Claude, who is temporarily leading in programming, long-term planning and step-by-step problem solving capabilities, and also used a series of Qwen models for post-training.
Yesterday, Manus also reached a strategic cooperation with Alibaba Tongyi Qianwen, committed to realizing all the functions of Manus on domestic models and computing power platforms. |Image source: Manus
In the planner part, Manus has done a lot of work.
Since the shelf APIs or models currently on the market are essentially aligned for chat robot scenarios, during training, no matter how complex the user asks the question, the optimization goal of the training is to answer the user’s question clearly in one reply, but this is actually completely opposite to the planning required by the agent.
soIf an existing model on the market is used directly in the agent scenario without “alignment”, this model will always be eager for quick success and give a “muddled” result within a round of dialogue, just like many bullet point summaries.
“The alignment methods should be different. Our team believes that different data are needed to perform special alignment,” Xiao Hong said.
In October last year, Peak also recorded on Zhihu the progress and failure of an attempt to reproduce the OpenAI o1 interest project - the Steiner open source model. In fact, this project was doing pre-research on the step by step planning part of the Manus planner.
Generally speaking, Manus is simulating a person who does things. This is the team’s product definition of Manus as a general-purpose AI assistant. As for thinking about its boundaries, the team is probably still exploring it and needs more user use cases.
In an interview with Tencent Technology released before the release of Manus, Xiao Hong actually mentioned his initial thoughts on the versatility of Manus. “A very core issue, or a very important responsibility of product managers, is to control user expectations. Assume that it can do everything in the world, such as: How do I make $1 million? This is not something that should be performed by an Agent. But if we can give more specific examples to make everyone’s expectations more reasonable, everyone will use it more smoothly.”
In the early morning of February 27, Manus product partner Zhang Tao and chief scientist Ji Yichao (Peak) shed tears when they saw the Manus.im ranking results. Manus’s performance on the GAIA Benchmark exceeded that of OpenAI’s Deep Research, and it achieved this unexpected result at about 1/10 the cost ($2/task) of OpenAI’s benchmark.
Image source: Manus.im
A team of dozens of people became one of the first teams to make a universal agent product when agents reached a consensus on competition across the industry. They are also unique in product engineering and front-end interactive experience.
Positive feedback from things done is better than anything else. There is no better incentive for a startup team than this. But before that, how did Manus happen? Why was this team made?
“Today’s model capabilities are capable of completing some complex, multi-step tasks. But there are no such products, so everyone can’t feel it.”The insights Xiao Hong mentioned in previous interviews with Tencent Technology can be used to understand this issue.
at the same time,”Not many teams have the opportunity to try out Agent products. Because it requires a lot of composite abilities.He wants to work on Chatbot, some AI programming related, and browser related, because he needs to call the browser, and he has a good sense of the boundaries of LLM - what level it has developed to today, and what level it will develop to next. First of all, there are not that many companies that have these capabilities at the same time, and the companies that have these capabilities may be doing a very specific business at hand. Some of our classmates happened to have time to do these things together. “
“exactly”.
The “Butterfly Effect” team has achieved all the elements to make such a universal agent today, so now there is a universal agent with a relatively high degree of completion relative to the industry.
When asked what the decisive moment was when he wanted to start Manus, Peak restored more details. He said, “There is actually no ‘clean’ pivot in entrepreneurship.” Everything is coherent and has no clear boundaries.
“When making a product, I also frequently pay attention to the external situation.” There were a few things at that time. First, when I was making a browser, I made a client-side model. Later I found that the browser required a very wide range of scenarios and had different features. During the process, I discovered that the base model was getting stronger at an accelerated rate. The gap between it and the agent might be an alignment problem. Although the outside world may feel that large language models have gradually converged and hit a wall.
At the same time, the outside world was also changing. Cursor took off early last year, followed by Windsurf and Devin. This corresponds to the same context. Agents are popular in the field of programming, and the path to popularity is progressive. Cursor is a copilot for programmers, which improves programming efficiency. Starting from Windsurf, some automated processes are gradually introduced, allowing you to have stronger automation capabilities on your local machine. Devin has reached a new level of automation.
The trends of VC are also consistent. For example, last year and the year before last, YC invested in two types of companies. One is cloud Browser, such as Browser base; the second type is lightweight AI Sandbox virtual machines similar to e2b.
This shows that “the infrastructure of the model is maturing rapidly, and the infrastructure of Infra is also maturing rapidly. In addition, seeing that external products are gradually gaining more acceptance, we feel that this is a direction worthy of all-in. This is a very gradual and smooth process. In addition, the infrastructure accumulated during the development of browsers such as Chromium can be seamlessly migrated over, which is why we dare to develop browsers in the cloud.”
In summary, the keen perception and experience accumulation of requirements and models in the so-called “shell” jointly created Manus.Many of Monica’s scenarios require post-model training. At the same time, the most important lesson “less structure, more intelligence” has been strengthened in the practice of AI browsers. She found that the model’s ability has reached the level of being an agent, but the problem lies in alignment. What followed was three months of rapid evolution for Manus.
Previously, the “Butterfly Effect” team was once questioned about the value of “shelling”. It built Monica by integrating existing large models without developing large models by itself. It integrated functions such as chat, search, reading, writing, and translation. It also integrated many task execution scenarios through APIs one by one. By the end of last year, the number of users reached tens of millions.
Now, when Doubao, Quark, and Yuanbao are all vigorously promoting their Monica products, and when a small team is using existing technology to create the first general consumer-level agent, it is time to re-understand the “shell”.
What exactly are “shells” and “shells”?
In Xiao Hong’s view, all breakthroughs are brought about by models, which are basically model-driven and model-first. The shell is to display the technical innovations of the model in a way that users can perceive, and to encapsulate the model’s innovative capabilities in a way that users can best perceive.
Starting from this definition, DeepSeek App (including the display of the thought chain) is a shell of DeepSeek-R1, Cursor is a shell of Anthropic Sonnet 3.5, Perplexity is a shell of GPT-4, and ChatGPT is a shell of InstructGPT.
As model capabilities evolve rapidly, “that shell” also needs to evolve. After the capabilities of each generation of models evolve, it is not even necessarily the original manufacturer. It is a third-party manufacturer that presents its user-perceivable value. Just like Cursor brings user-perceived value to the Claude 3.5 Sonnet.
On March 5, the second anniversary of the release of Monica.im, the answer to why these dozens of people have achieved product experience that exceeds that of various Deep Research and OpenAI Operators lies in the understanding and practice of shells.
How to make the best shell for a new model that can be used as an agent?
As the builder of Manus, Zhang Tao believes, “Looking at its entire architecture from the background, we see that there is a lot of unfinished work to be done in every place, and each of those places is the key to success, and they are all places that make the product surface different.”
From the team’s perspective, the most important advantage is the pace of innovation. Both applications and models have now reached a state of relative saturation. The only real core capability in the end is to run fast, although the “data flywheel” and “network effects” have not yet been verified.
“In a brand new field, everything is uncertain and unknown. The most important thing is the speed of innovation. What we strive for is exploration, trial and error in various directions, and quickly finding the right path.” The Manus team is flexible enough in terms of management philosophy, organizational structure, and industrial processes. When new opportunities arise, you can use limited resources to connect all resources of the entire company, make decisions at a very high speed, and adapt to feedback on mistakes.
From left to right are “Butterfly Effect” chief scientist Peak, CEO Xiao Hong, and product partner Zhang Tao | Image source: Internet
Regarding Manus’s expectations, Xiao Hong believes that “even if there is a window period, it is worth giving it a try.” In the past year, his thinking has also undergone drastic changes. For example, he now believes that “when you realize that you are ahead of schedule, you are more aggressive and super aggressive. After reviewing today, I feel that Monica in 2023 was not aggressive enough.” “If you know that you are innovating and you are leading, you should be aggressive.”
I don’t know if Manus can bring Xiao Hong and his team the experience and leap from 1 to N, but this team that knows the most about “shell” believes in creating with the heart and hand as one, and also believes in the butterfly effect brought about by creation. Manus comes from a motto at MIT: Mens at manus, which emphasizes the unity of heart and hand. It cannot be optical, it must be done, and it can have an impact on the real world, which is real knowledge.
In the future, as more of the deposits behind Manus are open sourced, a wider range of butterfly effects will be further released.
This article is reproduced from [GEEEKPARK], and the copyright belongs to the original author [Wan Chen], if you have any objection to the reprint, please contact Gate Learn team, the team will handle it as soon as possible according to relevant procedures.
Disclaimer: The views and opinions expressed in this article represent only the author’s personal views and do not constitute any investment advice.
Other language versions of the article are translated by the Gate Learn team and are not mentioned in Gate.io, the translated article may not be reproduced, distributed or plagiarized.