Scaling Remains the Most Crucial Factor in Model Training| Open AGI Forum

First Post

A Conversation with Baosong Yang, Scientist at Alibaba Tongyi Lab

2025-6-24

gosimopen-sourcegrassroots

Author & Interview | Echo Tang
Edited|Shiqi Wang, Monica He
Produced | GOSIM

In recent years, as Large Language Model (LLM) technology has advanced rapidly globally, "going global" has become an unavoidable topic for every AI company. In this fierce global competition, Qwen, developed by Alibaba's Tongyi Lab, stands out prominently. Being one of the most downloaded models on Hugging Face, Qwen is even more popular with European and American users than Chinese users. Behind this international success lies Tongyi Lab’s strategic vision and deep expertise in multilingual technology.

During the GOSIM AI Paris conference in France, Echo Tang, Executive Editor of CSDN's New Programmer, had an insightful conversation with Baosong Yang, scientist at Alibaba's Tongyi Lab. Yang highlighted that internationalization was a core strategy from the very beginning, prioritizing multilingual data optimization to ensure fair access to AI technology globally.

Yang pointed out that taking AI models global isn't simply about language translation. It involves navigating the complex cultural norms and taboos associated with each language. To tackle this, the Qwen team established a sophisticated cultural annotation system, covering thousands of granular categories, to ensure content safety, compliance, and effectiveness globally. Despite the considerable effort required, Yang views this as immensely valuable.

Discussing challenges in multilingual AI, Yang mentioned the widespread "multilingual reasoning dilemma," where models often blend multiple languages when responding to queries in less-supported languages—a problem yet unsolved in the industry. Tongyi Lab addresses this issue by using native languages for reasoning in well-supported languages and defaulting to English for lower-resource languages, ensuring stable and consistent outputs.

The conversation also touched on the trending "Scaling Law" topic. Yang believes continuously expanding model size and data quantity remains critical. However, when faced with constraints on computing power and available data, enhancing "knowledge density" and using high-quality synthetic data emerge as new paths forward. Accordingly, the Qwen team is actively experimenting with model-generated data to overcome current scaling bottlenecks.

Dialogue transcript is as follows:

Echo Tang: Hello everyone, welcome to the Open AGI Forum hosted by GOSIM. I'm Echo Tang from CSDN. We're honored to have Baosong Yang, scientist at Alibaba's Tongyi Lab, joining us today at GOSIM AI Paris to share his experiences and the latest updates on Qwen. Welcome, Yang! Could you please introduce yourself first?

Baosong Yang: Hello everyone, I'm Baosong Yang from Alibaba's Tongyi Lab. I currently lead the team responsible for Qwen's multilingual capabilities, including machine translation.

Echo Tang: There's a funny anecdote about Qwen's English pronunciation, whether it's "Queen" or "Q wen." Could you clarify this officially?

Baosong Yang: Honestly, everyone can call it whatever they prefer, but officially it's pronounced "Queen." Of course, "Q wen" also sounds nice. Domestically, everyone prefers calling it "Qianwen," which flows naturally.

(Note: In English, "Q" typically makes a /kw/ sound as in "queen" or "quick." Thus, non-Chinese speakers unfamiliar with pinyin naturally pronounce "Qwen" as "queen," leading to humorous comparisons among developers that "Llama is King, Qwen is Queen.")

Echo Tang: How do you feel attending GOSIM AI Paris? Any special insights?

Baosong Yang: Paris is beautiful, full of history and great food. This conference gave us a rare opportunity to meet open-source colleagues worldwide, including engineers from Hugging Face. All major open-source organizations attended. We also met many new-generation researchers and developers, like the author of Llama Factory, making this conference an excellent networking opportunity. In China, we usually focus heavily on research and rarely have direct global exchanges like this.

Echo Tang: Qwen's downloads are very high on Hugging Face. Did you discuss with their engineers?

Baosong Yang: Yes, Hugging Face engineers highly appreciate Qwen, especially its derivatives. Over 100,000 models are based on Qwen now.

Echo Tang: Indeed, our statistics also show Qwen is hugely popular globally, notably in North America and Silicon Valley. Can you discuss Qwen's performance globally?

Baosong Yang: Interestingly, when we released Qwen Chat, English-speaking users topped our audience, while Chinese ranked third or fourth. This reflects our global impact.

Echo Tang: Why isn’t Chinese among the top two? Is this strategic?

Baosong Yang: Exactly. Internationalization has been a core strategy from the start. We prioritized multilingual training data to ensure global usability, even if it somewhat reduced Chinese or English-specific performance.

Echo Tang: You've always led Qwen’s multilingual efforts. Is this related to your academic background?

Baosong Yang: Yes, I started in machine translation around 2013, when deep learning was emerging. Choosing AI research back then required courage, but by 2015-2016, the AI field boomed. My research has always centered on multilingual translation because I believe multilingual capabilities are essential for true "technology democratization." We don't want only China and the US to enjoy AI benefits. Multilingual AI is crucial for global equity and practical for global business operations. Personally, I envision real-time translation through smart glasses, making all foreign languages accessible—once a distant dream, now rapidly approaching reality.


1. Challenges and Progress in Multilingual AI

Echo Tang: Qwen’s multilingual success must have involved challenges. Could you share some?

Baosong Yang: The toughest challenges involve data quality and labeling standards, especially for low-resource languages. Cultural differences mean certain content can be acceptable in one language but taboo in another, requiring extensive labeling and annotation. We've built detailed annotation systems to address these complexities—time-consuming but essential.

Echo Tang: Qwen3 was recently released, and we followed it closely. Were there particularly tough challenges solved this time?

Baosong Yang: Multilingual reasoning was very challenging. Ideally, models reason internally in the same language as user queries. However, many models mix languages, confusing users and wasting computational resources. Our compromise is reasoning internally in the native language if proficient; otherwise, defaulting to English. Although not perfect, this provides consistency and stability.

Echo Tang: Indeed, I've experienced this issue with other models myself. Has any model solved this completely?

Baosong Yang: Currently, no model has fully solved it, and it remains a significant challenge. High-resource languages handle it better, but low-resource languages pose severe difficulties. Complete multilingual reasoning in every language remains a tough frontier.

Echo Tang: Another common issue with reasoning processes (Chain-of-Thought) is their token-heavy nature. Any design improvements?

Baosong Yang: We've introduced selectable modes in Qwen3—either showing detailed reasoning or a concise mode without extensive reasoning steps, allowing users control. Researchers are also exploring more efficient thought processes. I expect rapid progress due to widespread focus on optimizing these issues.

Echo Tang: From what you're saying, it seems like significant progress is likely soon?

Baosong Yang: Definitely. Many Ph.D. students, academic researchers, and corporate teams—including our own—are deeply engaged in this issue. With intense focus industry-wide, substantial improvements should come quickly. Soon, you probably won’t have to deal with unnecessarily long reasoning chains anymore.


2. Training Large Language Models: Beyond Stacking Data

Echo Tang: Let’s move on to Scaling Law and model size. Currently, the industry faces a ceiling regarding data and computational power. There's a lot of debate on whether the classical Scaling Law remains effective. You mentioned "knowledge density" during your talk at GOSIM. Could you explain the role of knowledge density in large models?

Baosong Yang: Simply put, knowledge density refers to the concentration of useful information within the data. It's a crucial area of research. Over the past half year, we've observed small models with around 4B parameters matching or surpassing the performance of last year's much larger 70B models, largely due to improvements in the knowledge density and quality of the training data. Previously, training focused more on sheer volume of data and parameters, but now, more attention is given to refining data—removing noise, redundancy, and improving information quality. This approach allows models to learn equally or even more effectively with less data. Another direct benefit is increased training efficiency, significantly reducing computational costs.

Echo Tang: Is improving knowledge density a main focus currently?

Baosong Yang: It's certainly one of the primary directions, although maybe not the only crucial one.

Echo Tang: Then, what's currently the most crucial direction, or the top priority for improving model capabilities?

Baosong Yang: From a foundational research perspective, I still believe scaling—making models larger and feeding them more data—is key. Even synthetic data efforts, in some ways, aim to support scaling by providing additional high-quality data. For me, expanding model size and data quantity remains the priority.

Echo Tang: Given current constraints in computational resources and raw data, based on your multilingual experience, what methods can help extend the Scaling Law? Many consider synthetic data an effective approach. What's your perspective?

Baosong Yang: I see two main directions in synthetic data. First, models creating new knowledge themselves—such as generating high-quality Q&A pairs or summaries from recent news—to enhance reasoning capabilities. This self-generated knowledge could expand beyond the limits of available original data, essentially enabling models to evolve further. The second, more mature direction involves increasing knowledge density by compressing redundant data and filtering low-value content to ensure cleaner, richer datasets. Both strategies are essential for continuing the Scaling Law.

Echo Tang: Indeed, at NVIDIA GTC, Jensen Huang highlighted synthetic data's importance for scaling. However, synthetic data raises another issue: if everyone uses model-generated data, won't that content re-enter training datasets, leading to a recursive cycle where AI trains on its own output? Does this risk narrowing model knowledge?

Baosong Yang: This "data feedback loop" is indeed worth attention. But synthetic data isn’t new—machine-generated content existed long before ChatGPT. Previously, the internet was already filled with template-generated and machine-translated texts, which models have long used. Now, due to improved AI writing quality, there's heightened concern. However, the issue itself isn’t fundamentally new.

Echo Tang: That’s interesting—many AI concepts seem to revolve around longstanding issues. Reinforcement learning, synthetic data, hallucinations—all these existed long before today's LLM boom. Why do we see these "old" issues re-emerging now?

Baosong Yang: Technology often evolves through longstanding issues. Neural networks, proposed decades ago, only recently flourished due to advances in computing power and data availability. Older technologies revive when conditions mature. Today’s LLM excitement reflects the growth and expansion of longstanding principles and challenges.

Echo Tang: How should model training handle vast synthetic data influx?

Baosong Yang: First, synthetic data is inevitable and shouldn’t cause panic. Second, AI-generated content often matches or surpasses human-written quality. The key concern is filtering incorrect or hallucinated information. High-quality AI-generated content can still be valuable, provided it's balanced with diverse human-created data to avoid monotonous output.

Echo Tang: So synthetic data, once vetted for quality, can be included but controlled proportionally?

Baosong Yang: Exactly. We're open to high-quality synthetic data but maintain balanced proportions to ensure diversity. Explicit labeling of synthetic data is being explored, but distinguishing human from AI content isn't always crucial. Using different data sources at different training stages is usually sufficient.

Echo Tang: Regarding model sizes, various teams adopt different strategies. DeepSeek favors ultra-large models, while Qwen and many open-source models offer a range of sizes, including smaller models. What’s Qwen’s approach?

Baosong Yang: Our strategy aligns with Alibaba Cloud and Tongyi Lab collaboration, catering to customers needing efficient, low-latency solutions for edge devices. We prioritize smaller, efficient models for diverse deployment environments.

Additionally, smaller models reduce barriers for academia and developers, promoting widespread innovation and ecosystem growth.

Echo Tang: Indeed, many successful open-source models now mention fine-tuning based on Qwen. How does this validate your "small yet powerful" approach?

Baosong Yang: Seeing widespread adoption of Qwen-based models in various papers and projects validates our approach. Small models can indeed achieve substantial performance, making edge deployments—like on phones or smart glasses—more feasible.


3. The Future of AI Peripherals and System Integration

Echo Tang: Speaking of edge deployments, what's Qwen’s vision for AI peripherals like smart glasses?

Baosong Yang: AI peripherals like smart glasses and earbuds increasingly demand compact, efficient models for tasks like voice recognition and real-time translation. Qwen plans to offer optimized small-model versions for such deployments.

Echo Tang: Can you give a specific example of AI glasses? Take Meta's smart glasses, for instance—I was really excited when they were released, especially because they claimed to incorporate Llama model capabilities. So, I immediately bought a pair to try them out. However, when I attempted voice interaction, for some reason, the Llama integration didn't feel very effective. Right now, it’s more like a stylish wearable device rather than something genuinely smart.

My expectations were much higher. Ideally, I'd love to wear these glasses in the Louvre and have them explain the Mona Lisa directly to me in my language. Or when encountering an unfamiliar yet interesting painting, they could instantly tell me the artist’s name, historical context, and underlying symbolism. Similarly, when traveling abroad, they could translate conversations around me in real-time, letting me know what people are saying. In short, I envision smart glasses as a tool that enhances many aspects of everyday life and work. But as things stand today, their actual performance is still far from meeting these expectations.

Baosong Yang: Current smart glasses, like Meta's, are early-stage technologies. Early adoption involves trade-offs. Just as early smartphones evolved dramatically, AI peripherals will gradually mature in functionality and reliability. Hardware limitations like battery life and computing power must improve, alongside software advancements like multimodal integration and external interactions. Achieving real-time multilingual translation and instant contextual information remains challenging, but active research and development efforts promise substantial progress.

Echo Tang: Given Qwen’s capabilities at this point, how advanced can AI peripherals realistically become? What's your ideal goal?

Baosong Yang: Currently, many automotive companies, smartphone makers, and smart hardware firms, both domestic and international, are collaborating with Alibaba Cloud's Tongyi team, embedding Qwen’s model capabilities into their devices. Our lab previously released a Mobile-Agent system—not specifically from our multilingual group but part of Tongyi Lab's achievements. This system is essentially a vision-based smart agent capable of controlling smartphones visually. For example, if you instruct your phone to "order food," Mobile-Agent can simulate a person performing the ordering process on the screen. It already handles some straightforward tasks, and more complex operations should soon be achievable as development progresses.

For smart glasses or earbuds, targeted visual recognition and voice assistance functions are already possible. You can point your glasses at text, and it can display translated text within the lenses—basic OCR and translation are technologically feasible now. However, replacing the entire viewed text in real-time in its original context (such as converting French museum labels into Chinese directly in your vision) remains challenging. Techniques like rendering text within images and Diffusion-generated visuals are still not perfect, especially for non-Latin scripts. Recent attempts often produce awkward results, visually cluttered and practically unusable.

Still, I believe breakthroughs could happen within a year or two, given the intensive research in this area. Once effective methods emerge, practical applications will rapidly follow. Real-time simultaneous translation through earbuds, as you mentioned, is relatively mature technology today. For most major languages, we already have functional speech-to-speech translation and real-time text transcription. However, less common languages remain challenging due to limited data, highlighting ongoing multilingual data issues.

Echo Tang: We’ve discussed embedding models into various smart devices. I'm curious about integration at the system level. People are eager to see if phone and computer manufacturers might integrate large models directly into operating systems. Currently, most AI usage on phones involves separate apps like Tongyi, ChatGPT, or Gemini, distinct from native system experiences. How advanced is model-OS integration now?

Baosong Yang: Technically, most functions could integrate into systems, but fully open capabilities are still too risky. Large models sometimes produce hallucinations or inaccuracies, making deep integration potentially problematic. However, most interaction scenarios don't require perfect accuracy. Compared to earlier assistants like Siri, large models provide significantly better intent understanding and companionship, fueling rapid voice assistant advancements. The next real challenge involves highly deterministic, safety-critical operations where caution remains necessary.

Echo Tang: Indeed, device-side AI discussions primarily revolve around phones and cars. On phones, concerns focus on data privacy and security, while car integrations involve physical safety. What changes might occur from integrating large models and Agent technology in vehicles, and how should safety be approached?

Baosong Yang: I should clarify I'm not directly involved in automotive AI, so this is purely my perspective. Many car functions could benefit from AI, but not necessarily all through large models. Voice assistants controlling windows or air conditioning could significantly enhance user experiences compared to traditional fixed commands. Similarly, large models could enrich car navigation and travel planning. For instance, you could instruct the AI, "I'm in Paris now and want lunch followed by sightseeing—recommend restaurants and sights and plan the route." Unlike traditional systems that require manual input, large-model agents can seamlessly handle multiple tasks, coordinating reviews, selections, and navigation automatically. However, safety-critical tasks like autonomous driving decisions probably won't rely directly on large models soon.

Established sensor and control systems already handle these areas effectively. Large models excel at human-machine interaction and decision support, not precise vehicle controls. Interaction-level safety can be improved through continuous model enhancements and defensive strategies, but control-level safety will likely remain cautiously separate.

Echo Tang: Indeed, people generally accept large models for entertainment and interaction, but driving decisions remain cautious. We'll closely follow industry developments.

Baosong Yang: Absolutely, safety remains paramount.


4. From Llama to Moxin: Moving Beyond "Big Model Only" Mindset

Echo Tang: Lastly, from your personal journey, transitioning from machine translation to multilingual capabilities in large models, you've witnessed several major advances in language-related AI. Many debates center around whether large models will "replace" various professions. Can you discuss the pros and cons of technological replacement? For instance, AI already provides decent machine translation and simultaneous interpretation, which poses threats to translators, opportunities for industry growth, and benefits for the general public. What's your perspective?

Baosong Yang: I'll address this from two angles. First, regarding machine translation itself, large models haven't replaced it but have become a new technological pathway within the field, similar to neural machine translation replacing statistical translation previously. The introduction of large models expands possibilities, such as supporting more minor languages, multimodal translations including images and videos, and real-time interpretation. These capabilities broaden the field and enhance industry value, benefiting more people overall.

Regarding human translators and interpreters, large models currently function primarily as productivity tools, analogous to code autocomplete tools for programmers. Why should we toil ten hours daily when productivity tools could enable more leisure? If AI can handle repetitive tasks, professionals should leverage this efficiency gain for higher-value activities, assuming societal attitudes and employment standards evolve accordingly. Forcing workers to use AI to increase productivity without fair compensation will undoubtedly cause anxiety and injustice.

Long-term, truly skilled linguists, translators, and creative professionals are less likely to be fully replaced because their output is judged qualitatively rather than simply right or wrong. Unlike coding, where correctness is binary, translations and writings can vary greatly in style and nuance. Human-produced content often possesses unique creative flair or subtlety that AI can't easily replicate, ensuring continued demand for top-tier human talent.

Echo Tang: While your intention was to ease public anxiety, I suspect your comments may have inadvertently made programmers on platforms like CSDN even more anxious.

Baosong Yang: Haha, there's really no need for anxiety. I'm confident that new roles and opportunities will emerge in the future. For instance, we might see positions like AI programmers or Prompt engineers, specialists dedicated to interacting with various AI models and helping them perform tasks better—similar to current Agent orchestration. The recently popular MCP protocol, for instance, coordinates multiple models and tools. People skilled at combining different AI tools to solve complex problems will become highly sought after. This represents a new form of development work, shifting focus from traditional coding logic to orchestrating multiple AIs.

Echo Tang: Speaking of this, you're right—some AI programming and AI agent platforms are already performing similar tasks. Development tools like Cursor can directly invoke various models and plugins to help developers orchestrate complex tasks. Many domestic teams are also exploring this area. So, the "new programmer" role you're mentioning is already taking shape.

Baosong Yang: Exactly, fundamentally it's the same exploration happening now. I want to emphasize that as AI capabilities expand across industries and devices, the demand for talent in this field will only grow. Once underlying AI interfaces and protocols mature across smartphones, cars, glasses, etc., developing various AI applications will become straightforward and highly popular. At that point, "AI application developers" may become even more common than today's mobile or web developers, making it an area with tremendous potential.

Echo Tang: Well said. As we conclude, let's refocus on your specialty—multilingual AI. For you and others dedicated to multilingual large models, what milestones have been achieved so far? And what's the next key challenge you hope to overcome? Please also share your vision for future directions.

Baosong Yang: Overall, multilingual progress in large models is slower compared to trending areas like mathematics or programming. Though not the hottest topic, multilingual capabilities are crucial. With Qwen, we've achieved a basic level of multilingual proficiency, where models can understand instructions and respond fluently in languages with abundant data. But two major unresolved issues persist. First is cultural alignment—users from different cultures expect varied responses even to simple questions like "What to eat tonight?" For example, Chinese users might prefer local cuisine recommendations, while Arabic users might require considerations like Halal restrictions. Future AI assistants must meet diverse cultural needs in role-playing, emotional companionship, and various scenarios, which is challenging. Our recent evaluations indicate that many large models currently reflect either predominantly Eastern or Anglo-American cultural biases, highlighting significant room for improvement.

Echo Tang: Do you think this issue is solvable?

Baosong Yang: Relatively speaking, yes. Reinforcement learning and value alignment methods can help adjust model behaviors across cultures. However, the deeper difficulty is comprehensive knowledge coverage across cultures, requiring expanded data, increased model parameters, or incorporating RAG (Retrieval Augmented Generation). Still, cultural alignment isn't the toughest problem. The more challenging issue is generating natural, human-like language outputs across languages. Even in widely-used languages like Chinese and English, model outputs occasionally feel unnatural. With thousands of languages globally, current large models fluently support only dozens, less than 1%. For most languages, models understand reasonably well but struggle to generate quality, natural text.

This issue is challenging because language generation heavily relies on Scaling—massive data and large models—which is problematic for many languages due to data scarcity. Synthetic data generation or cross-modal transfers (converting speech or visual data to text) may become necessary solutions. We're actively exploring these strategies.

Echo Tang: Are these immediate issues for you or more strategic, long-term problems?

Baosong Yang: They are both immediate tasks and long-term strategic challenges. As researchers, we must balance immediate problem-solving with long-term research commitment. Our team is investing in these areas despite resource constraints. However, these challenges also make research exciting and motivating. Besides the two issues I mentioned, a third significant area involves integrating multilingual and multimodal capabilities, which has limited current research. Recognizing multilingual text in images, generating multilingual speech and subtitles, and understanding various dialects are just starting points. As text, images, and audio increasingly merge, there will undoubtedly be substantial research opportunities.

Echo Tang: It sounds like you're calling for more researchers to join forces!

Baosong Yang: Yes, we definitely hope more people will join the effort. By open-sourcing the Qwen project and providing a high-performance foundational model, we aim to collaborate broadly with the open-source community. We've benefited immensely from community contributions—high-quality multilingual datasets like FineWeb and CulturaX, various benchmarks, and data filtering techniques are community achievements. These multilingual challenges can only be effectively tackled collaboratively. Achieving AGI is beyond the reach of a single team or company; it requires the joint effort of thousands of scientists and developers. For example, FineWeb 2.0 uses native language experts or crowdsourced personnel to clean internet data meticulously—efforts beyond our team's capability but achievable through community cooperation. This highlights the true value of open-source collaboration.

Echo Tang: That's the charm of open source! Let's all enjoy open source and enjoy AI together! Thank you, Mr. Yang, for today's insightful and engaging discussion. Thanks also to our audience for watching. Please follow GOSIM on X, YouTube, Bilibili, and other platforms. We will continue bringing more insights from the frontiers of open source and AI. That's all for today—see you next time!