To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
The Transformer architecture powers today’s most popular public and private AI models. So what’s next? Is this the architecture that leads to better inference? What’s next for Transformers? Today, to build intelligence into models, they require large amounts of data, GPU computing power, and rare talent. As such, models are typically expensive to build and maintain.
The adoption of AI began with making simple chatbots more intelligent. Now, startups and large enterprises have figured out how to package intelligence in the form of a co-pilot that augments human knowledge and skills. The next natural step is to package multi-step workflows, memory, personalization, and more in the form of an agent that can solve use cases for multiple functions, including sales and engineering. The hope is that with a simple prompt from the user, the agent will be able to classify the intent and break down the goal into multiple steps to complete the task, whether that be searching the internet, authenticating to multiple tools, or learning from past repetitive behavior.
Applying these agents to consumer use cases, we can see a future where everyone has a personal agent like Jarvis on their phone that understands them. Want to book a trip to Hawaii, order food at your favorite restaurant, or manage your personal finances? A future where you can use a personalized agent to securely manage these tasks is possible, but from a technology perspective, we are still a long way from that future.
Is transformer architecture the final frontier?
The self-attention mechanism in the Transformer architecture allows the model to evaluate the importance of each input token simultaneously with respect to all tokens in the input sequence. This improves the model’s language and computer vision understanding by capturing long-distance dependencies and complex token relationships. However, long sequences (e.g. DNA) increase computational complexity, resulting in poor performance and high memory consumption. Solutions and research approaches to solving the long sequence problem include:
- Transformer improvements on hardwareThe promising technologies here are Flash AttentionThe paper argues that the performance of Transformers can be improved by carefully managing reads and writes to different levels of fast and slow memory on the GPU. This is achieved by making the attention algorithm IO-aware, reducing the number of reads/writes between the GPU’s high-bandwidth memory (HBM) and static random access memory (SRAM).
- General notes: Self-attention mechanisms have complexity O(n^2), where n is the length of the input sequence. Is there a way to linearly reduce this quadratic complexity, allowing the Transformer to better handle long sequences? Optimizations here include techniques such as reformers, performers, etc. Skyformer others.
In addition to these optimizations to reduce the Transformer’s complexity, several alternative models are challenging the Transformer’s dominance (although most are still in their infancy).
- State Space Models: These are a class of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or near-linear computational complexity for long sequences. State-space models (SSMs) are Mamba It’s more suitable for long distance relationships, but the performance is inferior to Transformers.
These research approaches have now left university labs and are available in the public domain in the form of new models for anyone to try. Additionally, the latest model releases inform the state of the underlying technology and viable paths for alternatives to Transformer.
Featured model release
Regular contributors like OpenAI, Cohere, Anthropic, and Mistral continue to release their latest and greatest models. Meta’s underlying model is Compiler optimizations It is notable for the effectiveness of its code and compiler optimizations.
In addition to the mainstream Transformer architecture, production-grade State Space Models (SSM), Hybrid SSM Transformer models, Mixture of Experts (MoE), and Composition of Experts (CoE) models are now emerging that appear to perform better across multiple benchmarks when compared to state-of-the-art open source models.
- Databricks Open Source DBRX Model: This MoE model has 132B parameters. It has 16 experts, 4 of which are active simultaneously during inference or training. It supports a 32K context window and the model was trained with 12T tokens. Other interesting details: It took 3 months, $10M, and 3,072 Nvidia GPUs connected with 3.2Tbps InfiniBand to complete pre-training, post-training, evaluation, red teaming, and model refinement.
- SambaNova Systems Release Samba CoE v0.2: This CoE model consists of five 7B parameter experts, only one of which is active during inference. All experts are open source models, and along with the experts the model has a router. The router understands which model is best suited for a given query and routes the request to that model. It is very fast, generating 330 tokens/sec.
- AI21 Lab Release Jumba This is a hybrid Transformer and Mamba MoE model. It is the first production-grade Mamba-based model with elements of the traditional Transformer architecture. “The Transformer model has two drawbacks. First, its high memory and computing requirements prevent it from handling long contexts, making the key-value (KV) cache size a limiting factor. Second, since there is no single summary state, each generated token performs computations across the context, slowing down inference and reducing throughput.” SSMs such as Mamba can handle long-distance relationships better, but they lag behind Transformers in performance. Jamba compensates for the inherent limitations of the pure SSM model, providing a 256K context window and fitting 140K contexts on a single GPU.
Challenges for corporate adoption
While there is great excitement surrounding the latest research and modeling that supports Transformer architecture as the next frontier, we must also consider the technical challenges that are preventing companies from realizing its benefits.
- Frustrated by lack of enterprise features: Imagine selling to a CXO without simple features like Role-Based Access Control (RBAC), Single Sign-On (SSO), and access to logs (both prompt and output). Today’s models may not be enterprise-ready, but enterprises are budgeting separately to ensure they don’t miss out on the next big thing.
- Destroying what once worked: AI copilots and agents make securing data and applications complicated. Imagine a simple use case: the video conferencing app you use every day introduces an AI summarization feature. As a user, you might appreciate the ability to get a transcript after the meeting, but in a highly regulated industry, this enhancement could suddenly become a nightmare for your CISO. In effect, something that previously worked fine breaks and must undergo additional security review. When SaaS apps introduce features like this, enterprises need guardrails to ensure data privacy and compliance.
- The ongoing battle between RAG and tweaks: It is possible to deploy both at the same time, or neither. Both can be deployed without making too many sacrifices. Search Augmentation Generation (RAG) is seen as a way to ensure that facts are presented correctly and information is up to date. Fine-tuning, on the other hand, is seen as what results in the best model quality. Fine-tuning is difficult, which is why some model vendors discourage it. It also includes challenges with overfitting that negatively impact model quality. Fine-tuning appears to be under pressure from multiple sides. As the context window of models widens and token costs fall, RAG may become a better deployment option for enterprises. In the context of RAG, the recently released Cohere’s Command R+ model It is the first open-weight model to beat GPT-4 in the chatbot space. Command R+ is a cutting-edge RAG optimized model designed to power enterprise-grade workflows.
I recently spoke with an AI leader at a major financial institution who claimed that the future doesn’t belong to software engineers, but to creative English/Arts students who can create effective prompts. There may be some truth in this comment; with a quick sketch and a multimodal model, even non-technical people can create simple applications without too much effort. Knowing how to use such tools is a superpower and can be useful to anyone looking to succeed in their career.
The same is true for researchers, practitioners, and founders. Today, there are multiple architectures to choose from to make the underlying model cheaper, faster, and more accurate. Today, there are many ways to modify the model for a specific use case, including fine-tuning techniques and new breakthroughs such as Direct Preference Optimization (DPO), an algorithm that can be considered as an alternative to Reinforcement Learning with Human Feedback (RLHF).
There’s a lot of rapid change happening in the field of generative AI, and it can feel overwhelming for founders and buyers alike to prioritize, so I’m excited to see what those building something new come up with next.
Ashish Kakran said, Tom Best Ventures The firm focuses on investing in early stage Cloud, Data/ML and Cybersecurity startups.
Data Decision Maker
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including technologists working with data, can share data-related insights and innovations.
If you want to hear about cutting edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You might also consider contributing your own article.
Learn more about DataDecisionMakers