Moxin 7B, Trained for $160K Challenges Llama and the 'Bigger is Better' AI Philosophy | Open AGI Forum

Dialogue with Yanzhi Wang, Professor, Northeastern University
2025-6-24
Author | Monica He
Interview | Eric Wang
Produced | GOSIM
As competition among large language models heats up and AI technology delves deeper into complexity, who will truly drive artificial intelligence toward greater efficiency, inclusivity, and reliability? Professor Yanzhi Wang, Electrical and Computer Engineering at Northwestern University, offers new pathways for the industry through his pioneering research in energy-efficient intelligence and fully reproducible AI systems.
Professor Wang has long focused on real-time, energy-efficient AI systems, edge computing, and generative large language models, covering cutting-edge research in deep learning, AI architecture optimization, and large language model system engineering. His research has been recognized or adopted by prominent institutions such as Google, Microsoft, and MIT, highlighting his significant academic influence and strong industry connections.
As a prominent advocate in the open-source AI community, Professor Wang’s team developed the Moxin-7B model, achieving performance comparable to mainstream models like Llama and Qwen at a notably low training cost of $160,000. Additionally, the model represents systematic innovations in open-source training data and code, lightweight deployment, and reinforcement learning-based self-optimization. Addressing industry challenges such as computational bottlenecks, data anxiety, and the prevalent myth of "bigger is always better," Moxin-7B demonstrates an exceptional cost-effectiveness through the approach of combining small language models with reinforcement learning. This innovation significantly broadens possibilities in edge intelligence, embodied intelligence, and multimodal applications. The team's commitment to full open-source practices, including sharing training data and code, also marks an important step forward for transparency and collaboration in the AI research community.
At Open AGI Forum, co-organized by GOSIM and CSDN at GOSIM AI Paris 2025, Professor Yanzhi Wang returns to share his latest research achievements from the past year, the innovations behind Moxin-7B, work on replicating and extending reinforcement learning frameworks, and his insightful views on the evolving AI ecosystem.
Key Insights Preview:
- Moxin-7B is a fully open-source 7-billion-parameter model trained from scratch, with all training data and code publicly available for community replication and verification.
- Future AI chip development may shift towards larger-memory architectures. If extremely large language models like MoE become mainstream, current GPUs with high computational power but limited memory could face limitations.
- Enhancing small language models via fine-tuning typically involves distilling data from large language models, which requires manual annotation and verification. Therefore, directly improving small language models using reinforcement learning without manual data annotation could be revolutionary.
- DeepSeek and Qwen have made significant progress in openness compared to foreign counterparts like Llama, which despite being labeled "open-source," has restrictive licensing conditions. Large enterprises often require explicit authorization from Meta when using Llama-based models at scale.
- Llama remains one of the few U.S.-dominated large language models with open-source attributes. Thus, even if models like Qwen or DeepSeek outperform Llama 4, Llama models may still be used due to licensing constraints, creating a strategic opportunity for domestic models in cutting-edge applications such as embodied intelligence.
1.Moxin-7B: Open-source Innovation and Reinforcement Learning Breakthroughs in Small language models
Echo Tang: What advancements has your research team made in the past year?
Yanzhi Wang: This year, our main focus was developing the Moxin-7B model. Unlike models such as Qwen or Llama, which don't fully open-source training data and code, Moxin-7B was trained entirely from scratch with everything publicly available. This allows the research community to fully replicate our training process, enhancing security and transparency. Community users can also apply the model to their own datasets, performing optimizations like quantization and distillation. The availability of training data significantly improves migration accuracy.
When performing distillation and quantization of models like Qwen or DeepSeek, we also benefited from this foundation, seeing accuracy improvements of 0.5% to 0.8%. This underscores the importance of training a model from scratch. Currently, both our base model and chat model versions of Moxin-7B match or even outperform mainstream 7B versions such as Llama 3 and Qwen 2.5 in several metrics. It's noteworthy that the 7B model version of Qwen 3 didn't see substantial metric improvements, as their primary focus was on MoE models. Although the 7B models have certain limitations, our model shows clear advantages in some areas, especially compared to the Llama series.
Apart from model metrics, we made significant efforts in reinforcement learning, replicating a reinforcement learning framework and algorithms from DeepSeek.
DeepSeek's core innovation is their GRPO method, which enhances intermediate reasoning steps via reinforcement learning, especially in tasks with verifiable final outcomes, such as math or programming problems. Traditionally, other models rely on manually annotated intermediate reasoning steps. DeepSeek suggests this reinforcement learning approach is more effective for large language models, but we found this not entirely accurate. Small language models enhanced via fine-tuning typically require data distilled from large language models, involving complex manual annotations. If small language models can directly improve themselves through reinforcement learning, it eliminates the need for manual annotation, enabling self-improvement in tasks like mathematics.
We implemented this framework and observed significant performance boosts in math and programming abilities. This approach is particularly valuable in scenarios lacking structured data, like specialized programming tasks (e.g., Rust programming) or embodied intelligence. Though difficult to obtain extensive training data in these areas, final results (e.g., successful code execution or robot task completion) remain verifiable. Hence, as long as outcomes are verifiable, intermediate processes can self-optimize through reinforcement learning, without manual intervention.
Replicating this reinforcement learning approach for small language models is meaningful, as DeepSeek's largest model is prohibitively expensive to fine-tune. Our method allows any small, edge-side model to continuously improve independently. Crucially, even without initial relevant data, models can generate it themselves, highlighting profound significance. We demonstrated a real-time edge-side translation application at GOSIM, capable of making decisions or processing each step within a second, significantly faster than previous implementations. Through quantization and fine-tuning, we achieved notable performance improvements. Compared to our previous presentation in Beijing, which required NVIDIA platforms, we now accomplish this on a single laptop, significantly enhancing speed. In the future, our models could run on smaller edge devices like smartphones, embodied intelligence gadgets, or smart glasses, supported by model-specific methods.
Echo Tang: Your team achieved remarkably low training costs and replicated DeepSeek’s reinforcement learning work. Could you detail these achievements, especially regarding high performance, low cost, and full transparency?
Yanzhi Wang: We uphold the principle of full transparency. As later entrants, we leveraged existing research, improving upon existing model architectures, thus reducing extra innovation workloads. Our $160,000 training cost represents actual computational expenses, excluding soft costs. Our goal isn't artificially low pricing but providing a transparent cost benchmark for the community. Replicating a similar 7B model like Llama 3 using our approach would cost roughly the same, requiring no special technical barriers.
Moreover, we avoided NVIDIA's most expensive GPUs, opting instead for cost-effective solutions, adapted and optimized using frameworks like Colossal-AI, a critical factor in cost management.
From a technical standpoint, our primary validation was that reinforcement learning significantly enhances model performance, notably through our implemented GRPO algorithm, achieving commendable experimental results.
Echo Tang: Why base your architecture on Mistral 7B instead of the Llama series?
Yanzhi Wang: Licensing considerations primarily guided this decision. Llama models have restrictive usage terms limiting modifications and redistributions. Mistral's architecture is more openly licensed, explicitly allowing reuse, prompting our structural adjustments to maintain performance and enhance accuracy.
Echo Tang: Could you elaborate on the GRPO algorithm's technical details?
Yanzhi Wang: Our RL optimization included critical improvements, notably tokenizer optimization. Domestic models often favor Chinese tokenizers, while Mistral’s tokenizer emphasizes English, underperforming in Chinese contexts. We adopted a multilingual tokenizer strategy similar to MoE, significantly improving bilingual processing and cross-language capabilities.
Our greatest innovation in GRPO is developing a generalized reinforcement learning framework. Traditionally, reinforcement learning optimization focused mainly on specific tasks—for example, DeepSeek improved mathematical capabilities. However, our framework is scalable and transferable, suitable for various tasks including translation, code generation, embodied intelligence, and even new material discovery.
2."Pseudo-Open Source" or Technical Barrier? Revealing the True Difficulty of Fully Open-Sourcing Large language models
Echo Tang: When I first saw the release of the Moxin-7B open-source model, I felt strongly that the open-source landscape for large language models has persistent challenges. The community strongly desires truly open-source models, with complete openness in training code, datasets, and model weights. Although teams like DeepSeek claim open-source status, strictly speaking, they still withhold critical training mechanisms or core architectures. How difficult is it to achieve true open-source for large language models?
Yanzhi Wang: That's certainly the case. Yet, we must acknowledge that DeepSeek and Qwen have already made significant strides. In contrast, the Llama series from abroad, despite claiming to be "open-source," has notable licensing restrictions. For instance, large enterprises using Llama or its derivatives at scale need explicit permission from Meta. Thus, in terms of openness, DeepSeek is quite advanced. However, a key technology of DeepSeek—the large-scale MoE training—currently leads the industry, making it challenging for other teams to replicate quickly. This partly explains why their core training code isn't fully open-sourced, which is understandable.
Our team has only replicated parts of DeepSeek’s technologies, particularly those related to reinforcement learning, but not their MoE architecture. The reason is practical: MoE’s high memory consumption makes it unsuitable for devices like smartphones. Hence, we prioritize balancing lightweight and practical implementations.
Echo Tang: Could you elaborate on the challenges here, such as Llama’s difficulties with MoE, Qwen’s progress, and specifically why DeepSeek is so challenging to catch up with?Yanzhi Wang: When we started the MoXin project, we noticed DeepSeek’s V2 release, which initially employed an MoE architecture with around 200-300 billion parameters but only activated about 25 billion. At that stage, DeepSeek didn't significantly outperform mainstream models.
However, when they scaled up to a total of 610B-671B parameters with 35B active parameters, the model capabilities substantially improved. Despite roughly half the computational resources compared to Llama or Qwen's 70B models, DeepSeek achieved remarkable performance improvements in accuracy and mathematical reasoning, proving the effectiveness of their MoE approach.
More specifically, training such large-scale MoE designs is extremely challenging, involving model parallelism, load balancing, and gradient stability issues. DeepSeek's success demonstrates substantial expertise in engineering and algorithm optimization.
This breakthrough made international teams, including Llama, anxious. Llama quickly pivoted towards MoE architectures but lacked DeepSeek’s training experience and engineering knowledge, leading to less impressive results and lukewarm industry responses.
Qwen has made better progress but still lags behind DeepSeek in large-scale MoE implementations. However, in small and medium-sized models, Qwen performs exceptionally well, representing one of the leading open-source models today.
Overall, DeepSeek and Qwen embody the most competitive technological paths within the open-source ecosystem, with DeepSeek significantly ahead in large-scale model architecture and engineering capabilities.
Echo Tang: Indeed, recent reports suggest DeepSeek’s advances have pressured Meta's Llama team. Do you have industry insights regarding large language model training?
Yanzhi Wang: While I don't have detailed internal information, many U.S. institutions face political and compliance restrictions influencing their choice of foundational models, typically preferring Llama. Major cloud providers like Amazon AWS, despite recognizing superior performance from DeepSeek or Qwen, still favor Llama 4 for compliance and safety.
This reality shows that Llama remains one of the few U.S.-dominated open-source large language models. Thus, even if surpassed by Qwen or DeepSeek, Llama might still be adopted out of necessity, creating market opportunities for domestic models in cutting-edge fields like embodied intelligence.
Replacing foundational models built on Llama with Qwen significantly boosts overall system accuracy and performance, positively impacting subsequent research and publications. Moving forward, DeepSeek or Qwen as base models provide clear advantages over the Llama series, offering greater innovation potential and practical effectiveness. This trend has advanced more rapidly than anticipated; a year ago, Qwen and Llama seemed evenly matched in benchmarks, but real-world usage reveals Qwen’s superior practicality and robustness.
3.Training Large language models Isn’t About "Stacking Data," but "Refining Data"
Echo Tang: Industry trends indicate data quality and strategy increasingly affect large language model effectiveness. Data bottlenecks remain core challenges, with low-quality data creating numerous problems and synthetic data becoming a hot discussion topic. Could you share your key observations on data issues and detail the data strategy for the Moxin-7B project?
Yanzhi Wang: For the MoXin project, we adopted a cautious, quality-first approach. Our main datasets include:
Slim Pajama (based on Red Pajama V2): After filtering out invalid content (around 4% of the original 20GB), usable data was less than 1GB.
DCRM dataset: High-quality corpus from Apple, carefully screened for effectiveness.
TextCoder dataset: Primarily for code generation, significantly enhancing logical reasoning.
Overall, MoXin’s total training data is under 2GB, yet its quality ensures excellent model performance. Expanding data to 15GB with low-quality content wouldn’t significantly benefit the model.
We prioritized high-quality sources like Wikipedia during training, assigning greater weight to enhance fundamental knowledge. We limited synthetic data to specific reinforcement learning scenarios, verifying accuracy rigorously. Given current data availability, synthetic data remains unnecessary for smaller models but might become essential for larger ones. A paradox exists: generating high-quality synthetic data requires stronger models, which themselves depend on extensive quality data—a critical bottleneck.
Echo Tang: Previously, larger investments meant better outcomes. With reduced training costs, how has model training changed?
Yanzhi Wang: My view remains tentative: GPUs like NVIDIA RTX 4090, powerful yet memory-constrained, may not align with future model trends. Devices with moderate computing power but larger memory and bandwidth could become essential if trends like DeepSeek persist. Consequently, technologies like NVLink might lose attractiveness.
AIPC may become a crucial trend, even in cloud testing environments. While current AIPC devices have limitations, we successfully deployed DeepSeek on some.
An ideal AIPC should locally deploy models like DeepSeek or Qwen’s largest versions, support various agents (dialogue systems, translation, embodied intelligence), and host comprehensive AI application ecosystems—similar to a local App Store without ongoing cloud-based costs, contrasting with services like OpenAI’s subscription model.
Echo Tang: Developers currently focus on model performance in code scenarios, especially backend development. What shortcomings exist in mainstream models, and how does your solution address them?
Yanzhi Wang: Mainstream models like Claude excel in frontend tasks (UI and web development) but struggle with complex backend development tasks (kernel programming, migrating C to Rust, writing render engines). These tasks demand precise language adherence, memory safety, and performance.
Our approach leverages reinforcement learning to continually optimize model behavior for Rust and similar systems languages, ensuring syntactic correctness and practical metrics like maintainability and efficiency. These goals are embedded in GRPO’s reward function, enabling steady quality improvements, unlike general models like Claude, which cannot continuously iterate optimally through reinforcement learning.
4.Collaborating Large and Small Language Models: From Edge Deployment to Multitasking Hybrid Architectures
Echo Tang: Discussions around collaboration between large and small language models have surged recently, reversing previous neglect of small language models‘utility. Could you elaborate based on your academic and industry experiences?
Yanzhi Wang: Previously, small language models were undervalued, with reliance solely on large language models. However, edge deployment offers distinct advantages, especially regarding privacy, security, and stability. Small language models effectively support edge tasks economically, emphasizing their essential role alongside large language models.
However, small language models cannot solve all problems on their own, making collaboration between large and small language models the ideal approach: small language models handle tasks they can manage at the edge, while complex tasks are delegated to large cloud-based models.
Previously, this collaboration was difficult because small language models lacked sufficient capabilities and accuracy for general applications. But thanks to technologies like DeepSeek, accuracy has improved significantly through knowledge distillation. For instance, small language models with just 150 million parameters can now effectively handle tasks like generating reports, performing remarkably well. Though these models might not excel in general tasks, they are highly effective in specialized domains such as translation, writing assistance, and report processing. Complex fields like mathematics might still pose some challenges, but overall, their performance is quite satisfactory.
The key to enhancing small language model performance lies in knowledge distillation techniques, especially from models with "Chain of Thought" reasoning capabilities. Small language models don't inherently possess this reasoning ability—it is distilled from larger models. Combining open-source large language models with open-source small language models enables teams and companies to leverage these capabilities using open technologies. This open-source ecosystem greatly accelerates progress, encouraging broader participation and promoting collaborative development between large and small language models.
This trend towards collaboration between large and small language models is clear and likely to shape future developments. Of course, technological evolution is uncertain, so this prediction isn't absolute, but I personally see strong potential in this approach.
Echo Tang: Could you share your latest progress and future plans in embodied intelligence?
Yanzhi Wang: We've recently achieved significant advances in embodied intelligence. By using our own models and training methods, coupled with optimizations to OpenVLA (a Stanford-proposed embodied intelligence model), we've greatly improved both speed and accuracy—specifically, increasing speed by more than 30 times without sacrificing accuracy.
This means real-time control for embodied intelligence is achievable even on resource-limited platforms such as mobile phones. Previously, such systems required server support, but now we can enable fully server-independent rapid responses. We are now exploring more complex tasks previously considered difficult, such as playing ping-pong or opening doors. Two key advantages of this work include extremely low latency response times and the ability to self-learn without extensive demonstration samples. For instance, in playing ping-pong, even if initial performance isn't perfect, the system quickly learns effective responses. This capability mirrors early reinforcement learning successes in Atari games—a crucial technical precursor to AlphaGo.
Echo Tang: So, you believe future AI models will increasingly focus on edge deployment, potentially leading to hybrid cloud-edge architectures?
Yanzhi Wang: Absolutely. The future will move towards cloud-edge integration, with edge devices primarily handling task distribution and local responses. Some tasks, such as translation, must be completed at the edge due to often unreliable network conditions encountered during travel or large events, making cloud reliance impractical.
Not all tasks must be edge-based, though; many still require cloud-based computing resources. The optimal strategy involves task-specific collaborative deployment, leveraging the strengths of each side.
Echo Tang: Thank you very much, Professor Wang, for sharing your insights in this GOSIM Open AGI Forum episode.
5. From Llama to Moxin: Moving Beyond "Big Model Only" Mindset
In this in-depth discussion on future AI trends, Professor Yanzhi Wang demonstrated profound academic expertise and keen industry foresight. Through the Moxin-7B project, he validated the vast potential of small language models in low-cost, high-performance edge deployment, further enhancing their capabilities through reinforcement learning and enabling "self-evolution," thereby reducing traditional reliance on ultra-large language models.
As Professor Wang emphasized, future AI development should not endlessly pursue model scaling but instead focus on efficiency, transparency, and reproducibility. His advocated architecture—combining cloud-edge integration with large-small language model collaboration—may become crucial for achieving widespread adoption and practical implementation of AI.
In this evolving era that increasingly moves away from the "big model only" mindset, Professor Yanzhi Wang is undoubtedly contributing visionary academic strength and technological conviction to the open-source AI ecosystem.