With a score of 90.0% on the MMLU (massive multitask language understanding) test, it’s the first model to outperform human experts (89.8%), as well as GPT-4 (86.4%) in a range of knowledge and problem solving tasks across a range of 57 subjects including math, physics, history, law, medicine and ethics. That’s experts, not the average human.
Gemini is multimodal from the ground up – meaning that its original training data set contained a ton of other media in addition to text. Thus, you could say it’s as fluent in visual and auditory “understanding” as it is with text. Where other language models have tended to “think” in textual terms when looking at video and images, Gemini retains all the tone and nuance of the original video, audio and image sources.
While the video below is a slick product demo, and thus should be taken with a large grain of salt, it’s worth watching to give you a sense of what this multimodality really means.
What’s the upshot here? Well, AIs are being trained with wider and wider sensory datasets, to mimic the processes by which humans learn to interact with the world. With next-level visual and auditory understanding, Gemini’s perception and reasoning take a step forward. Once this thing lands in Google devices – beginning with the next Pixel phones – it’ll be able to help with all sorts of daily tasks.
And as Google Deepmind CEO Demis Hassabis tells Wired, this will soon extend into the next logical sensory realm: touch and tactile feedback. Google is already a major player in AI robotics, but embedding a super-knowledgeable model like Gemini with the ability to understand the world through touch will take robotics – humanoid and otherwise – into uncharted territory.
Multimodality is far from the only banner feature here, but as with GPT-4, Gemini is such an anything machine that it’s hard to know where to start. Perhaps with the contributions it could make to science? In the video below, Deepmind scientists demonstrate how Gemini is able to generate its own code to read and interpret 200,000 scientific studies, filtering them for relevance using its own reasoning capabilities, and then collate data and effectively create new meta-knowledge. The team says it did this all over their lunch break, and that it’ll be relevant to other fields like law in which huge datasets need to be examined.
Speaking of coding, Gemini is fluent in Python, Java, C++ and Go programming. Indeed, Google is already showing off how it can create websites that dynamically code themselves as you use them, in response to what you seem to want from them. This feels like a whole new approach to the internet; you go to a single page that grows into what you need as soon as it figures out what that is.
The demo video here uses a pretty lightweight use case: planning a kid’s birthday party. But you can see the extraordinary power it encapsulates, and imagine how it might create graphical user interfaces for nearly any task you could imagine. This is the sort of thing only AI can do; it’s like having a web app programmer sitting right next to you, but capable of working hundreds of times faster.
And as with any AI tool, it’s super interactive; if it’s not giving you exactly what you want, you can just tell it, and it’ll adjust itself to fit your desires, or engage in a conversation about the best way to proceed. Stunning stuff, and a glimpse into how our interactions with technology are fundamentally shifting.
On the topic of coding, Deepmind has done some other interesting work with Gemini in a project called AlphaCode 2 (warning: link is a PDF technical report), which takes several different Gemini models and trains them specifically in different parts of the programming process.
In essence, AlphaCode 2 creates a swarm of programming agents, and gets them to generate up to a million different chunks of code to solve a problem. It then uses a separate Gemini model to examine these code samples, check if they compile, and rank them on how well they do their portion of the overall coding work, discarding around 95% of the samples created.
Then, another Gemini model develops a code-testing regime and sample test data, and runs a thorough testing process on all the remaining code samples, ranking them on “correctness,” to find the top pieces of code. Effectively, Deepmind has split Gemini into a multifunctional software team, with specialist AIs working on requirements analysis, system design, testing, deployment and maintenance as well as a giant army of coders.
How does it perform? Well, in a coding competition against humans, it beat 87% of other entrants, ranking it “just between the ‘Expert’ and ‘Candidate Master’ categories on Codeforces.” As Deepmind scientists explain in the video below, these kind of contests require a ton more than just coding skills – they require extraordinary degrees of rational understanding and creative use of the available software tools.
Mind you, AlphaCode 2 isn’t going to be available to the public immediately, or indeed ever in its current form. Generating a million code snippets, as you might imagine, burns a ton of computing power and is way too expensive for general release. But what’s interesting here is that the success rate doesn’t appear to have tapered off at a million snippets – indeed, it seems that AlphaCode would continue to improve its results if it went well into the billions, or trillions. That’s an incredibly inefficient way to do things, but with the blinding speed of progress in this area, a smarter way is sure to come along very soon.
Deepmind says it’s looking at how a streamlined version can be brought into the public models.
There’s more; there’s a ton more. But this should give you a sense of what Google is promising here. Google is planning to release it in three model sizes: Gemini Nano, built for installation right on board mobile devices, Gemini Pro – a rough equivalent of GPT 3.5, which will be the main workhorse model for most tasks, and Gemini Ultra, the largest model, which Google says beats GPT-4 handily across a broad swathe of benchmark tests – gapping it even more substantially on multimodal testing than on text-based challenges.
Gemini Ultra is scheduled for public launch next year, once it’s been more thoroughly vetted for safety and alignment issues. That’s when we’ll start getting a proper sense for where it outshines GPT and where it’s just not up to snuff. Gemini Nano, meanwhile, is already available on the Pixel 8 Pro smartphone, and will begin rolling out on others.
Gemini Pro, though, is available right now, for free, to anyone with a Google account through the Google Bard service. It’s a slimmed-down version, unfortunately, with only the ability to upload images rather than documents, audio or video, but Google says it’ll gain new capabilities soon. It’s already got access, with your permission, to operate on your Gmail, Google Drive and Google Docs, as well as flight and hotel bookings, Google Maps, and YouTube, where it allows you to interact and ask questions about videos.
And yep, Google is working to integrate the Gemini model into pretty much every product it makes.
Buckle up, y’all, this roller coaster only knows how to accelerate.