ByteDance’s UI-TARS can take over your PC, outperforms GPT-4o and Claude

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


A new AI agent has emerged from TikTok’s parent company to take control of your computer and perform complex workflows.

Much like Anthropic’s Using a computerByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning, and takes autonomous step-by-step actions.

Trained on approximately 50B tokens and available in 7B and 72B parameter versions, PC/MacOS agents achieve state-of-the-art (SOTA) performance on more than 10 GUI benchmarks for performance, perception, grounding, and overall agent capabilities, consistently beating GPT- 4o of OpenAI, Claude and The Google Twins.

“Through iterative learning and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to contingencies with minimal human intervention,” write researchers from ByteDance and Tsinghua University new research paper.

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works in desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.

Its user interface features two tabs – one on the left showing step-by-step “thinking” and a larger one on the right where it downloads files, websites and apps and automatically takes action.

For example, in a demo video posted today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th of the next month and filter by price in ascending order.”

In response, UI-TARS navigates to the Delta Airlines website, fills in the “from” and “to” fields, clicks on the relevant dates, and sorts and filters by price, explaining each step in his mind box before taking action.

Another scenario instructs to install autoDocstring extension in VS Code. Here is his step-by-step thinking as he completes this task:

  • It first reports that “to proceed with installing the autoDocstring extension in VS Code, I must first open the VS Code application.”
  • Acknowledging that the VS Code window is not fully loaded, the model notes that “you should wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all necessary functions are available for the next steps.”
  • It then explains that it needs to access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar.
  • Running into a small problem, he reasoned, “The previous attempt to click the Extensions tab may not have been precise enough, so I’ll try clicking it again to make sure the Extensions view opens successfully.” “
  • Once in the extensions view, he types in “autoDocstring” before concluding, “Since the installation process has already started, no further action is required at this time. I will wait for the installation to complete and confirm its success once the process is complete.”

Outdoing your competitors

In various benchmarks, researchers report that UI-TARS consistently outperforms OpenAI’s GPT-4o; Claude-3.5-Sonnet of Anthropicus; Gemini-1.5-Pro ​​and Gemini-2.0; four Qwen models; and multiple academic models.

For example, on VisualWebBench—which measures a model’s ability to ground web elements, including web page quality assurance and optical character recognition—the UI-TARS 72B scored 82.8%, beating the GPT-4o (78.5%) and Claude 3.5 (78.2%).

It also performed significantly better on the WebSRC (understanding semantic content and layout in web contexts) and ScreenQA-short (understanding complex mobile screen layouts and web structure) benchmarks. UI-TARS-7B achieved a leading score of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, beating Qwen, Gemini, Claude 3.5 and GPT-4o.

“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers wrote. “Such perceptual ability lays the foundation for agent tasks where accurate understanding of the environment is critical to task performance and decision-making.”

UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2, which evaluate the model’s ability to understand and localize GUI elements. In addition, the researchers tested its capabilities in scheduling multi-step actions and low-level tasks in mobile environments and compared it to OSWorld (which evaluates open computing tasks) and AndroidWorld (which evaluates autonomous agents of 116 programming tasks in 20 mobile applications).

Source: Arxiv
Source: Arxiv

Under the hood

To help it take step-by-step action and recognize what it sees, UI-TARS was trained on a large-scale set of screenshots that analyzed metadata including element description and type, visual description, bounding boxes (position information ), element and text feature from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but also spatial relationships and overall layout.

The model also uses state transition labels to identify and describe the differences between two consecutive screenshots and to determine whether an action, such as a mouse click or keyboard input, has taken place. Meanwhile, the Set of Marks (SoM) prompt allows it to overlay different marks (letters, numbers) on specific regions of an image.

The model is equipped with both short-term and long-term memory to handle current tasks while preserving historical interactions to improve later decision-making. The researchers trained the model to perform both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-stage decision-making, “reflective” thinking, stage recognition and error correction.

The researchers emphasized that it is critical that the model can maintain consistent goals and engage in trial and error to hypothesize, test, and evaluate potential actions before performing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identified errors and labeled corrective actions; for afterthought, they simulated recovery steps.

“This strategy ensures that the agent not only learns to avoid errors, but also dynamically adapts when they occur,” the researchers wrote.

Clearly, UI-TARS shows impressive capabilities, and it will be interesting to see its evolving use cases in the increasingly competitive AI agent space. As the researchers note: “Looking forward, while native agents represent a significant leap forward, the future lies in integrating active learning and lifelong learning, where agents autonomously manage their own learning through continuous real-world interactions.”

The researchers state that Claude Computer Use “performs strongly on web-based tasks, but struggles significantly in mobile scenarios, indicating that Claude’s GUI capability does not transfer well to the mobile domain.”

In contrast, “UI-TARS shows excellent performance in both the website and mobile domains.”


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *