Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
Meta announced just in time for Halloween 2024 Meta Spirit LMthe company’s first open source multimodal language model that seamlessly integrates text and audio input and output.
As such, it competes directly with other multimodal models such as OpenAI’s GPT-4o (also natively multimodal), Hume’s EVI 2, and proprietary text-to-speech and speech-to-speech products such as Celebrities.
Designed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM delivers more expressive and natural speech production while learning tasks across modalities such as automatic speech recognition (ASR), text, and more. It aims to address the limitations of existing AI voice experiences. to-speech (TTS), and speech classification.
Unfortunately for entrepreneurs and business leaders, this model is currently only available for non-commercial use with the following conditions: Meta’s FAIR Non-Commercial Research LicenseThis grants you the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM Model solely for non-commercial purposes. Distribution of these models or derivatives must also comply with non-commercial restrictions.
A new approach to text and audio
Traditional speech AI models utilize automatic speech recognition to process speech input before synthesizing it with a language model and then converting it to speech using text-to-speech technology.
Although this process is effective, it often comes at the expense of expressiveness inherent in the human voice, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating voice, pitch, and tone tokens to overcome these limitations.
Meta has released two versions of Spirit LM.
• spirit LM base: Process and generate audio using audio tokens.
• Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more subtle emotional states such as excitement or sadness and reflect them in the generated audio.
Because both models are trained on a combination of text and audio datasets, Spirit LM maintains the natural expressiveness of speech in the output, while still providing a wide range of audio-to-text, text-to-speech, etc. can perform cross-modal tasks.
Open source non-commercial — available for research purposes only
In line with Meta’s commitment to open science, the company will fully open source Spirit LM, providing researchers and developers with supporting documentation on which to base their model weights, code, and construction.
Mehta hopes that the open nature of Spirit LM will encourage the AI research community to explore new ways to integrate voice and text into AI systems.
The release also includes: research paper Describes the model’s architecture and functionality in detail.
Mark Zuckerberg, CEO of Meta, is a strong supporter of open source AI, and in a recent open letter said that AI has the potential to “improve human productivity, creativity, and quality of life.” said that it has the power to accelerate progress in fields such as medical research and medical research. scientific discovery.
Applications and future prospects
Meta Spirit LM is designed to learn new tasks across a variety of modalities, including:
• Automatic speech recognition (ASR): Convert spoken words into written words.
• Text-to-speech (TTS): Generate spoken words from written text.
• Audio classification: Identify and classify speech based on content and emotional tone.
of Spirit LM Expressive This model goes a step further by incorporating emotional cues into speech production.
For example, emotional states such as anger, surprise, and joy can be detected and reflected in the output, making interactions with AI more human-like and engaging.
This has significant implications for applications such as virtual assistants, customer service bots, and other conversational AI systems where more nuanced and expressive communication is essential.
Broader initiatives
Meta Spirit LM is part of a wide set of research tools and models made available to the public by Meta FAIR. This includes updates to Meta’s Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which is used across fields such as medical imaging and meteorology, as well as research into improving the efficiency of language models at scale. Included.
Meta’s overarching goal is to achieve advanced machine intelligence (AMI) with a focus on developing powerful and accessible AI systems.
The FAIR team has been sharing its research for more than a decade, aiming to advance AI in ways that benefit not only the technology community but society as a whole. Spirit LM is a key component of this effort, pushing the boundaries of what AI can achieve with natural language processing while supporting open science and reproducibility.
What’s next for Spirit LM?
With the release of Meta Spirit LM, Meta takes a major step forward in the integration of voice and text in AI systems.
Meta provides a more natural and expressive approach to AI-generated speech and open-sources the model, allowing the broader research community to explore new possibilities for multimodal AI applications That’s what I’m trying to do.
Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning and has the potential to power a new generation of more human-like AI interactions.