NVIDIA
NVIDIA, a renowned player in the AI industry renowned for building some of the most highly sought-after GPUs, has made a significant announcement. It has released an open-source large language model that is reportedly on par with leading proprietary models from giants like OpenAI, Anthropic, Meta, and Google. In a recently released white paper, the company introduced its new NVLM 1.0 family, and at the forefront is the 72 billion-parameter NVLM-D-72B model. The researchers wrote, “We introduce NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models.”
Recommended Videos
Introducing NVLM 1.0, a family of frontier-class multimodal LLMs that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., InternVL 2). Remarkably, NVLM 1.0 shows improved text-only… pic.twitter.com/yKGyOqHnsp
— Wei Ping (@_weiping) September 18, 2024
Get your weekly teardown of the tech behind PC gaming
ReSpec
Subscribe
Check your inbox!
Privacy Policy
The new model family is said to already possess the capability of “production-grade multimodality,” delivering exceptional performance across a wide range of vision and language tasks, along with enhanced text-based responses compared to the base LLM on which the NVLM family is based. “To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities,” the researchers explained. The result is an LLM that can effortlessly explain why a meme is funny and solve complex mathematics equations, step by step. Thanks to its multimodal training style, Nvidia also managed to increase the model’s text-only accuracy by an average of 4.3 points across common industry benchmarks.
The new model family is said to already possess the capability of “production-grade multimodality,” delivering exceptional performance across a wide range of vision and language tasks, along with enhanced text-based responses compared to the base LLM on which the NVLM family is based. “To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities,” the researchers explained. The result is an LLM that can effortlessly explain why a meme is funny and solve complex mathematics equations, step by step. Thanks to its multimodal training style, Nvidia also managed to increase the model’s text-only accuracy by an average of 4.3 points across common industry benchmarks.
NVIDIA
NVIDIA seems serious about ensuring that this model meets the Open Source Initiative’s newest definition of “open source” by not only making its training weights available for public review but also promising to release the model’s source code in the near future. This is a marked departure from the actions of rivals like OpenAI and Google, who jealously guard the details of their LLMs’ weights and source code. By doing so, Nvidia has positioned the NVLM family not necessarily to directly compete with ChatGPT-4o and Gemini 1.5 Pro but rather to serve as a foundation for third-party developers to build their own chatbots and AI applications.