How Apple could transform your iPhone forever

Joe Maring / Digital Trends
Over the past few months, Apple has been steadily releasing research papers detailing its work with generative AI. However, Apple has been tight-lipped about what exactly is being developed in its research labs, while rumors suggest that Apple is in talks with Google to license its Gemini AI for iPhones.

There have been a few glimpses of what to expect. In February, an Apple research paper described an open-source model called MLLM-Guided Image Editing (MGIE) that can perform media editing using natural language instructions from users. Now, another research paper on Ferret UI has set the AI community abuzz.

The idea is to deploy a multimodal AI (one that understands texts and multimedia assets) to better understand elements of a mobile user interface. And the most crucial goal is to provide actionable tips. This is a significant milestone as engineers strive to make AI more useful for the average smartphone user than the current “parlor trick” status.

In this pursuit, the major push is to decouple the generative AI capabilities from the cloud, eliminate the need for an internet connection, and execute all tasks on the device itself, making it faster and more secure. For example, Google’s Gemini is running locally on the Google Pixel and Samsung Galaxy S24 series phones – and soon on OnePlus phones – and performing tasks like summarization and translation.

What is Apple’s Ferret UI?

Apple
With Ferret-UI, Apple appears to aim to combine the intelligence of a multimodal AI model with iOS. Currently, the focus is on more “elementary” tasks such as “icon recognition, finding text, and listing widgets.” However, it’s not just about understanding what is displayed on an iPhone’s screen, but also logically comprehending it and answering contextual queries posed by users through its reasoning capabilities.

The easiest way to describe Ferret UI’s capabilities is as an intelligent optical character recognition (OCR) system powered by AI. “After training on curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the ability to execute open-ended instructions,” notes the research paper. The team behind Ferret UI has fine-tuned it to accommodate “any resolution.”

You can ask questions like “Is this app safe for my 12-year-old kid?” while browsing the App Store. In such situations, the AI will read the age rating of the app and provide the answer accordingly. It’s not specified how the answer will be delivered – whether in text or audio – as the paper doesn’t mention Siri or any virtual assistant.

Apple didn’t fall too far from the GPT tree

Apple
But the ideas are far more comprehensive and intelligent. Ask it “How can I share the app with a friend?” and the AI will highlight the “share” icon on the screen. Of course, it will give you an overview of what’s on the screen, but at the same time, it will logically analyze the visual assets on the screen – just like boxes, buttons, pictures, icons, and more. This is a significant accessibility improvement.

If you want to hear the technical terms, well, the paper refers to these capabilities as “perception conversation,” “functional inference,” and “interaction conversation.” One of the research paper’s descriptions perfectly summarizes the Ferret UI possibilities, describing it as “the first MLLM designed to execute precise referring and grounding tasks specific to UI screens, while adeptly interpreting and acting upon open-ended language instructions.”

As a result, it can describe screenshots, tell what a particular asset does when tapped, and determine whether something on the screen is interactive with touch inputs. Ferret UI is not a standalone project. Instead, for the reasoning and description aspects, it relies on OpenAI’s GPT-4 technology, which powers ChatGPT, along with a host of other conversational products.

Notably, the specific version proposed in the paper is suitable for multiple aspect ratios. In addition to its on-screen analysis and reasoning capabilities, the research paper also describes a few advanced capabilities that are quite fascinating to imagine. For example, in the screenshot below, it seems capable of not only analyzing handwritten text, but also predicting the correct version from the user’s misspelled scribble.

It is also able to accurately read text that is cut off at the top or bottom edge and would otherwise require a vertical scroll. However, it’s not perfect. On occasion, it misidentifies a button as a tab and misreads assets that combine images and text into a single block.

When compared to OpenAI’s GPT-4V model, Ferret UI delivered impressive conversation interaction outputs when asked questions related to the on-screen content. As can be seen in the image below, Ferret UI prefers more concise and straightforward answers, while GPT-4V writes more detailed responses.

The choice is subjective, but if I were to ask an AI, “How do I buy the slipper appearing on the screen,” I would prefer it to simply give me the right steps in as few words as possible. And Ferret UI performed admirably in not only keeping things concise but also in accuracy. In the aforementioned task, Ferret UI scored 91.7% in conversation interaction outputs, while GPT-4V was only slightly ahead with 93.4% accuracy.

A universe of intriguing possibilities

Apple
Ferret UI marks a remarkable debut of AI that can understand on-screen actions. Before we get too excited about the possibilities, it’s important to note that there are several uncertainties due to multiple factors. Bloomberg recently reported that Apple is aware of being behind in the AI race, as evidenced by the lack of native generative AI products in the Apple ecosystem.

First, the rumors that Apple is even considering a Gemini licensing deal with Google or OpenAI indicate that Apple’s own work is not on par with that of the competition. In such a scenario, leveraging the work Google has already done with Gemini (which is now attempting to replace Google Assistant on phones) would be a wiser choice than pushing a half-baked AI product on iPhones and iPads.

Apple clearly has ambitious ideas and is continuously working on them, as demonstrated by the experiments detailed in multiple research papers. However, even if Apple manages to fulfill Ferret UI’s promises within iOS, it would still amount to a superficial implementation of on-device generative AI.

Functional integrations, even if limited to in-house preinstalled apps, could yield amazing results. For example, let’s say you are reading an email while the AI has already analyzed the on-screen content in the background. As you read the message in the Mail app, you can ask the AI with a voice command to create a calendar entry from it and save it to your schedule.

It doesn’t necessarily have to be a complex multistep chore involving multiple apps. Say you are looking at a Google Search knowledge page for a restaurant, and simply saying “call the place” causes the AI to read the on-screen phone number, copy it to the dialer, and start a call.

Or, let’s say you read a tweet about a film premiering on April 6 and tell the AI to create a shortcut to the Fandango app. Or, a post about a beach in Vietnam inspires your next solo trip, and simply saying “book me a ticket to Con Dai” takes you to the Skyscanner app with all your entries already filled in.

But all of this is easier said than done and depends on multiple variables, some of which may be beyond Apple’s control. For example, webpages full of pop-ups and intrusive ads would make it extremely difficult for Ferret UI to perform its job. However, on the positive side, iOS developers strictly adhere to the design guidelines set by Apple, so it is likely that Ferret UI will work more effectively on iPhone apps.

This would still be a significant achievement. And since we are talking about an on-device implementation tightly integrated at the OS level, it is unlikely that Apple will charge for the convenience, unlike mainstream generative AI products like ChatGPT Plus or Microsoft Copilot Pro. Will iOS 18 finally give us a glimpse of a reimagined iOS powered by AI intelligence? We’ll have to wait until Apple’s Worldwide Developers Conference 2024 to find out.

Unveiling the Future: Mobile Internet and Smart Gadgets - Your Trusted Tech Hub for Cutting-Edge Insights"

Unveiling the Future: Mobile Internet and Smart Gadgets - Your Trusted Tech Hub for Cutting-Edge Insights"

What is Apple’s Ferret UI?

Apple didn’t fall too far from the GPT tree

A universe of intriguing possibilities

mayask

Related Posts

ChatGPT’s new Canvas feature like Claude’s Artifacts vividly

OpenAI raises $6.6B in latest funding round

You Missed

New Avatar: The Last Airbender game looks super ambitious

PS5 colorful chrome accessories pre-order now

ChatGPT’s new Canvas feature like Claude’s Artifacts vividly

OpenAI raises $6.6B in latest funding round

Qualcomm aims to add cool AI tools to Android phone

Reddit in $60M deal with Google for AI tools boost