Dialogue with Jenia Jitsev, LAION Research Lead | Open AGI Forum

First Post

A Scientific Renaissance Born out of "Despair"

2025-6-24

gosimopen-sourcegrassroots

Author & Interviewer | Eric Wang
Produced by | GOSIM

In January 2021, when OpenAI's DALL-E model generated "stunning images" from a string of text, the entire tech world was shaken.

However, amidst the praise, a more complex emotion was quietly brewing in the hearts of a group of scientists, engineers, and enthusiasts – a "despair" mixed with awe and profound frustration.

This despair stemmed from an increasingly sharp contradiction: the most cutting-edge and revolutionary achievements in the field of artificial intelligence were being increasingly locked away in the "black boxes" of a few tech giants. Models were not open-sourced, datasets were not public, and research processes could not be reproduced. For the scientific community, which views "reproducibility" as its cornerstone, this was tantamount to having their path blocked by a huge rock. How can you study a phenomenon that you cannot even independently verify?

The spark was ignited outside an ordinary high school classroom in Germany. Christoph Schuhmann, a physics and computer science teacher, was deeply moved after reading the DALL-E paper. A strong idea sprouted in his mind: we must be able to reproduce such a model! He immediately realized that the primary and greatest obstacle was the massive training dataset that was unknown to the outside world.

In Elute AI servers, an online community of AI enthusiasts, Schuhmann put forward a seemingly simple but highly innovative idea: perhaps we can crawl image links with descriptive text from Common Crawl (a huge public web archive) to build our own dataset. At first, there were few respondents, and some people were quickly distracted for various reasons. But he did not give up.

Soon, a programmer named Theo Coombes became the first respondent, and the two jointly launched the "Crawling at Home" project. This marked LAION's transformation from one person's persistence to a collaborative effort. In March 2021, an independent Discord server came into being and quickly became the "nerve center" of this nascent organization.

This server was like a magnet dropped into water, quickly attracting "iron filings" from all over the world – a diverse network of scientists, developers, engineers, and ordinary enthusiasts, a "dream team" in the making:

  • Dr. Jenia Jitsev, the protagonist of our dialogue, a senior researcher at the Jülich Supercomputing Centre (JSC), holds the key to pushing this grassroots movement to an industrial scale – supercomputer computing power.
  • Richard Vencu, a senior engineer with nearly 30 years of industry experience, is responsible for building and maintaining the technical infrastructure on which LAION relies. We have also had a dialogue with him at the Open AGI Forum before.
  • Robert Kaczmarczyk, a doctor and epidemiologist, brought valuable ethical perspectives and social responsibility to the project.
  • There are also deep learning experts like Mehdi Cherti and Jan Ebert, and passionate programmers like Aarush Katta…

These individuals from different backgrounds gathered together for a common belief. They named this organization LAION (Large-scale Artificial Intelligence Open Network) and established its core tenets: 100% non-profit, 100% free. Its mission was clearly defined as "liberating machine learning research," aiming to combat the increasingly centralized and opaque trend in AI development by providing open datasets, tools, and models.

When Christoph Schuhmann's "Crawling at Home" project met Jenia Jitsev's supercomputer in Jülich, a real chemical reaction occurred. The enthusiasm of volunteers combined with world-class computing resources made it possible to reproduce and even surpass the models in the "black box."

Soon, Professor Ludwig Schmidt from Stanford University also joined in, bringing his research on OpenCLIP and the academic rigor of a top university, providing key endorsement for the quality of LAION's output.

The birth of LAION was not just out of technical curiosity, but a direct, conscious ideological response to the industry trends at the time. It represented a deliberate choice to forge a completely different path.

At the GOSIM AI Paris 2025 conference in Paris, France, we had an in-depth conversation with Dr. Jenia Jitsev, the scientific architect of this "scientific renaissance movement." His narration not only revealed the origins of LAION, its challenges, and how they continue to conduct cool "stress tests" for this booming industry through research like "Alice in Wonderland," but also demonstrated how an open network of distributed talent is changing the rules of scientific exploration.


LAION's Birth Stemmed from Despair Over "Black Boxes"

Eric Wang: Welcome, Jenia. How do you feel about the atmosphere at the GOSIM Paris AI Conference this time?

Jenia Jitsev: It feels great. I'm glad to see many old friends, like the Hugging Face and LightArch communities. I also ran into friends from the Llama Factory project. Our own OpenThoughts project uses their tools for fine-tuning and studying reasoning traces. We're also working closely with friends from Stanford and Berkeley to launch the OpenThinker model, with 32 billion and 7 billion parameters.

So you can see that the collaborative atmosphere in the open-source community is particularly good. Everyone is doing things that benefit each other. I hope to find better ways to deepen cooperation. After all, it's good to help each other and avoid reinventing the wheel, but this requires efficient organization and management.

Eric Wang: I was going to ask you to introduce yourself first, but I found that your resume has many roles. To help the audience understand you better, can you explain how these different roles and organizations are connected in your work?

Jenia Jitsev: Of course. Mehdi Cherti, Mariana Nijurina, and I are core researchers at the non-profit organization LAION, but our actual "employer" or funding institution is the Helmholtz Association, a large research organization in Germany.

You can think of the Helmholtz Association and the Max Planck organization as the two pillars of German research. The difference is that Helmholtz focuses more on operating large, expensive research equipment that requires a lot of funding and manpower to maintain, which includes supercomputers.

Our institute is part of Helmholtz, and it's called the Jülich Supercomputing Centre, near Cologne. This center has a long tradition of hosting and operating supercomputers since the 1980s. My research laboratory was born there and has been operating there ever since. Our research funding comes from the Helmholtz Association, which ultimately comes from the German government, so we are a public research institution.

And then there's the other line, which is our non-profit research organization, LAION. It was nominally founded in Hamburg in 2021, but I say "nominally" because we don't have a physical office in Hamburg. LAION is a decentralized organization made up of a group of powerful laboratories that have come together for a common goal. These goals basically revolve around open foundational models and the datasets necessary to create them.

So, you can think of LAION as a "consortium" that connects laboratories in various independent research institutions. For example, our laboratory in Jülich, Ludwig Schmidt and his powerful laboratory at Stanford, and Ryo Yokota, a senior researcher at LAION, and his student Taishi Nakamura's laboratory at the Tokyo Institute of Technology. In this sense, LAION is a network of powerful laboratories, and our own institute provides important "hosting" and support for this network. That's roughly the relationship.

Eric Wang: We previously spoke with Richard Vencu, another founder of LAION, and heard his story about LAION's origins. So, from your perspective, what's that story like?

Jenia Jitsev: Haha, that's interesting. Let's see if our stories match up. If they don't, that would be even more interesting.

But seriously, this story is indeed a bit like a movie. I'm not boasting, but the whole process was full of adventure, and looking back now, it's still fascinating. The core members of LAION joined for different reasons, but what drove us was a similar feeling of "despair"—we watched powerful models being created, models that clearly had immense research value, but we couldn't study them because they were locked in "black boxes," completely irreproducible to the outside world.

But everyone's specific motivations were slightly different. From our lab's perspective, the story truly began with an internal literature seminar. At that time, my colleague Mehdi Cherti was presenting OpenAI's paper on DALL·E 1.

That was 2021. At that time, two papers had a huge impact, and I remember both were related to Alec Radford, our hero forever: one was the CLIP paper, and the other was DALL·E. CLIP was somewhat more open; at least they released the model weights. But overall, the research was still irreproducible, and the dataset was not public. As for DALL·E, it was even more complete; we couldn't even get the model.

In that seminar, we all realized that we had to study these kinds of results as core machine learning problems because they were the source of "transfer learning," which is the holy grail of machine learning. We had to figure out why it worked.

So we immediately took action and searched online for anyone trying to reproduce it in an open-source way. As luck would have it, while the seminar was going on, we searched on Google and found some clues that someone had already built a Discord server, which seemed to be called "DALL·E Reproduction Server."

We also found LucidRains, or Phil Wang, who has a remarkable habit of personally reproducing all important research findings. We found his DALL·E reproduction project's GitHub repository and opened an issue (which can be understood as initiating a public discussion), roughly saying: "Hey, we want to do this. We see you already have some code snippets, but we want to make it truly impactful and reproduce a stronger model. We have supercomputers here and can provide enough computing power."

This post was public, and anyone could see it. I remember Mehdi Cherti writing the first comment, asking, "We are a group of researchers who can provide computing power, and everyone can work together."

This message immediately resonated on Discord. I'm ashamed to not remember everyone's usernames, but I remember one of the core figures at the time was Clay Mullis, who later also participated in the LAION 5B paper. They said, "Okay, we'll set up a Discord server and invite everyone who wants to reproduce DALL·E to see who can come and who can help."

That's how we, as researchers, found a group of like-minded people. Then, people like Richard, and Romain Beaumont, who works at Google in Paris, also joined. Richard seemed to be running his own small company at the time. There was also Christoph Schumann, a German high school teacher, who was also deeply impressed by these technological advancements and hoped that ordinary people could freely use these models. He focused more on data, believing that without data, everything was empty talk, so he tried to organize volunteers to collect data together.

So you see, the world is so wonderful: different forces converged. We were strong in model training, knowing how to operate supercomputers and how to do distributed training with hundreds or thousands of GPUs. On the other hand, there was a group of data-centric people who began to collect data.

Then, a very fortunate turning point occurred. Ludwig Schmidt, a researcher who was already well-known in academia at the time, noticed our efforts. He joined with his powerful team from the University of Washington and the Allen Institute for Artificial Intelligence. Coincidentally, they had already implemented a version of OpenCLIP and were experimenting on a small scale, and the results looked interesting. Of course, everyone understood that to study truly interesting phenomena, the scale had to be large.

So, our goal shifted from reproducing DALL·E to OpenCLIP. Because OpenCLIP's codebase was already available, and those who wanted to study it were all there. More importantly, the evaluation system for CLIP-like models was more mature. If you train a model, you can easily compare it with OpenAI's original CLIP and run benchmarks on public datasets to see how well it performs. But for generative models like DALL·E, the evaluation metrics (like FID) are much more complex, making it difficult to intuitively compare the quality of your results.

We believed CLIP was a better path because it's a representation learning model that can be used for various downstream tasks like image classification and retrieval, whereas DALL·E only generates pretty pictures and can't be used as a "model backbone" elsewhere. So we ultimately decided to focus on CLIP. Ludwig Schmidt joined, volunteers were collecting data, and our supercomputer was ready – all these factors came together, and the project truly took off.

This directly led to the release of LAION-400M and LAION-5B, two large-scale, completely open image-text datasets. We also received support from the Gauss Center for Supercomputing, which provided us with valuable computing time. Although the computing power was not entirely sufficient, it was enough for us to complete this experiment. Since then, OpenCLIP has become a model heavily reused by the community.

Looking back, the research results themselves are important, but I think the significance of that "moment of hope" might be even greater. Suddenly, everyone understood: yes, as long as capable members of the community can unite and gather the necessary resources, it is entirely possible for us to catch up with top laboratories like OpenAI.

Because the models we trained with these open resources performed on par with OpenAI's results on many downstream task benchmarks, and some were even better. This surprised us greatly. Our initial goal might have been just to reproduce the trend of "more computing power, better models," but we actually did as well as them, or even better, and also published scaling laws that were not provided in their original research.

You see, there's an interesting detail here: even the OpenCLIP B32 model we released a few years ago (trained on 34 billion samples from the LAION-2B dataset) still has over a million downloads every month. This shows that if you do things the right way, your results can have a lasting impact and continue to be useful to the community, rather than being a fleeting phenomenon.


From Neuroscience to "The Bitter Lesson"

Eric Wang: Let's turn back the clock a bit. Before the term "foundational models" became popular, what initially drew you into the field of neuroscience, particularly unsupervised learning in the visual cortex?

Jenia Jitsev: I was fascinated by the abstract process of "learning" itself. I kept thinking, how can a system capture external information, which is like floating in the air, and "engrave" it within itself? How does this process actually work? This puzzled me greatly.

The brain is undoubtedly a master in this regard. The reason I was interested in neuroscience was because the brain is a living system that has successfully achieved learning. My research direction at the time was to understand how learning mechanisms and plasticity cooperate in brain neural circuits.

But at some point, I realized that if I could abstract this problem one level higher, into the core field of machine learning, and strip away the intricate details of biology, I might make faster progress. Biological systems are too complex; some of the complexity may be related to information processing, but a large part of it may just be to maintain vital signs and has little to do with learning. It would take a very long time to fully clarify these things, and I felt it would be difficult for me to do it alone. So, I chose a simpler path, trying to understand learning at a purer, more abstract level.

Later, the entire field changed. Some of the relatively simple methods we studied at the time, such as so-called "winner-take-all-like circuits," were found to produce powerful learning systems by simply stacking them and continuously scaling them up. I also naturally turned to deep learning, which emerged around the time I was about to graduate with my Ph.D. and thinking about my future.

During that time, I still did some more biologically oriented research, such as dopamine-modulated reward-based learning mechanisms in the basal ganglia of the brain. But at the same time, my focus increasingly shifted towards classical computer vision and machine learning, eventually leading me to the Jülich Supercomputing Centre. This place, with its large computers, provided me with an excellent "soil" to "feed" those simpler but far larger networks with larger datasets than ever before. This naturally led me to the emerging field of "foundational models," which I see as the ultimate product of "transferable learning."

My core interest has never changed: what conditions are needed to achieve true "learning"—the kind of learning that can be transferred to various different scenarios and tasks? Foundational models were the first products that showed us how to go down this path. So, I started to dive in, trying to figure out why they succeeded, what significant open problems and weaknesses they still had, and to what extent their generalization capabilities could reach—because we've already seen that their generalization capabilities are far from as powerful as we imagined.

Eric Wang: In your GOSIM talk, you mentioned a watershed moment: before 2012, the entire industry relied on labeled data, and after that, huge breakthroughs were made in scalable and transferable learning. In your opinion, what was the fundamental idea that drove this shift?

Jenia Jitsev: I think the most important conceptual shift was that people finally realized that the way datasets are constructed plays a decisive role in the success or failure of learning. Previously, everyone focused more on the algorithms themselves, thinking that good algorithms were the key to everything, and the idea of "data-centric" was largely overlooked.

A clear change was that people began to understand that to build a good dataset, its data distribution must be as general as possible, and when collecting it, as little human prior knowledge and bias as possible should be added. Thus, "web-scale" datasets came into being. You just need to collect as much data from the internet as possible, and the diversity of the data itself is sufficient; you don't need to over-presume what "good" data is. Because it has been proven afterward that humans are actually very bad at judging what kind of data is beneficial or harmful for algorithms (and even for humans themselves) to learn.

At the algorithmic level, similar changes have occurred. People began to understand that the "scalability" and "generality" of algorithms are king. This is very similar to the spirit of "The Bitter Lesson" proposed by Richard Sutton, the father of reinforcement learning: if you can make a program continuously scale in the dimension of "the more computational resources invested, the better the effect," then you are on the right track.

You don't need to meticulously design elaborate rules to solve a specific problem, but rather you should think about the most general learning paradigm that can process any data you feed it.

People also understood the importance of "simplicity"—but achieving this simplicity is difficult in itself. Ultimately, what won was not the simplest thing, but the simplest thing that was most scalable. You must pursue scalability in both datasets and algorithms. You will find that even simple structures like Multi-Layer Perceptrons (MLPs) are scalable. But if you derive their scaling laws, you will find that their scaling efficiency is far less than that of the Transformer architecture.

The Transformer itself is still a relatively simple architecture, but it is simply more scalable than MLPs. You must find this delicate balance: maintaining the simplicity of the core mechanism while making it extremely scalable and general. It is these two key shifts at the dataset and algorithmic levels that have enabled us to create the models we have today that are so successful in transfer learning.


Big Company Sickness

Eric Wang: Now, it seems that research on foundational models can only be replicated by a few large industrial laboratories. Was this status quo the main catalyst for LAION's establishment? Or did you also hope to solve some other scientific dilemmas?

Jenia Jitsev: Even if you temporarily set aside the "black box" research problem and simply ask yourself: as a machine learning researcher, what do you need to explore the most important directions in this field? You will naturally conclude: you need open data so that others can continue to build on your work or replicate your experiments; you also need open-source training code.

This is essentially a return to the standard scientific method. If we want to study the most important phenomena in a field—and "transferable learning" and "foundational models" as its products are undoubtedly one of the most important phenomena at present—then the relevant tools and materials must be open to everyone. Only in this way can we make progress faster, and in a way that is repeatedly verified, rather than as it is now, where a company releases a model, and some magical claims emerge, but no one can verify the truth, and they can only spend a year or two guessing the truth based on various rumors and hearsay.

So, for those of us from the research community, the main catalyst was to create an environment where real scientific research could be conducted. I think other friends from the industry are more concerned about another issue: if only a few companies monopolize these key "building blocks," then all other participants in the industry can only passively wait for one or two giants to release something next, and then eagerly use what others give them, without even knowing how it came about.

For them, eliminating this concentration of power in the industrial landscape is a more urgent issue. I think LAION has indeed played a role in this regard. But for us researchers, the motivation is very pure: we must study things in a reproducible way. Since it is currently irreproducible, we will make it reproducible with our own hands.

Of course, we are also very grateful to those closed laboratories – OpenAI, DeepMind, Meta, Anthropic – because they have undoubtedly pointed the way for the entire field. Research is like this: you see some clues, and you should follow them. But at some point, we must ensure that these studies are reproducible for the open academic community, and this is where we put a lot of effort.

Eric Wang: LAION, EleutherAI, and BigScience are all very well-known grassroots research communities. Have you ever wondered what your life trajectory would have been like if you had chosen to join a closed laboratory to research large-scale models? Also, what makes LAION unique in the broader open-source ecosystem? I see your website says "100% non-profit, 100% free."

Jenia Jitsev: First and foremost, it's great that there's diversity in this field, and closed labs have made huge contributions. For me personally, choosing the open-source path was a very pragmatic, even "selfish," decision. I'm not at all opposed to working in a closed lab, earning a high salary, and having ample funding. But my long-term feeling is that if you do research in a more open, collaborative way, you can more easily connect with top talent from around the world and leverage their wisdom and skills.

Of course, companies like OpenAI and DeepMind also gather some of the world's top researchers, right? But the problem is that once you're in a commercial entity, you have to be accountable to investors, and they'll ask you, "What's your next profit growth point?"

So, research directions will inevitably tilt towards those that are easier to report to investors. This can be a big problem for fundamental research. You might be forced to ignore some very interesting research directions that don't have short-term commercial value because you have to meet some urgent commercial needs. OpenAI's situation might be special; as a non-profit organization, they might be able to escape some of this pressure, and they have indeed had periods of long-term losses in pursuit of important research. But even so, when you start collaborating with large companies like Microsoft that have a huge influence, you'll still feel pressure, preventing you from exploring the directions you believe are most valuable at the moment.

So, I'll emphasize again that what I mean by "selfish" is: I personally believe that some of the most worthwhile research directions are not so compatible with commercialization paths.

If I knew how to make them perfectly compatible, I wouldn't mind working in a closed lab at all. But what I see in reality is that the open-source model can promote those truly important research directions in a more efficient way. Maybe in the short term, in a closed lab, you can push things forward more intensely with concentrated firepower. But I believe that in the medium to long term, a broader community with a massive number of top talents will gain greater momentum in research efficiency. This is essentially an efficiency issue. Of course, I might be wrong.

Eric Wang: You spent about 10 pages in your talk discussing scaling laws and how to achieve reproducible scaling laws with LAION-400M, LAION-2B, and OpenCLIP. In practice, what are the biggest challenges to achieving "reproducibility"? And how does this demonstrate the power of open data and models?

Jenia Jitsev: One of the clearest challenges is the construction of the dataset itself. There are so many problems here that I can't finish talking about them in a short time. Just going through the entire process from scratch – accessing Common Crawl, extracting useful metadata from it, and then going through layers of filtering to ensure quality – is already very difficult. Then you have to actually download the images, which is equally maddening. At first, you even need to have great courage to invest a lot of effort in collecting things, and these things are very likely to turn into a failed dataset for various reasons.

I think we were lucky. We initially expected that the models trained with the data we collected would be very mediocre and far from top-notch. But the results were better than expected, and that was a lucky moment. Of course, troubles followed. Because our data is open, some illegal samples were quickly discovered in it.

This precisely highlights a major advantage of openness: transparency. Once a problem is discovered, you can say, "Okay, let's fix it." We later collaborated with the Internet Watch Foundation (IWF) and a Canadian organization, who provided a large list of hashes of bad data. This was a big shock to me: it turns out that there is still so much bad content on the public internet that has not been removed by service providers, which is strange because they are supposed to do so. Using these hash values, we were able to release a repaired version of the LAION dataset.

All in all, to make scaling laws reproducible, a huge amount of work must be invested in the dataset alone. I think this requires more people to unite, because it is unhealthy for a single organization to bear all this work alone. Of course, there are other challenges. You need to be very familiar with supercomputer operations, and you need to know how to design scientific experiments to obtain good enough measurement data to plot graphs, fit meaningful curves, and accurately predict model performance at a larger scale. This is purely scientific expertise, which we also had to learn from scratch. There is also the computing time itself, and the effort involved in applying for these resources. You need to regularly write project applications to secure computing power, which is also a fairly tedious process. So, the cost of reproducibility is quite high.

I think it's important that the open-source community strengthen communication and discuss how to make these processes easier. For example, everyone can jointly apply for funding, and experienced people can share computing time and help each other. In this way, these tedious processes will not "crush" small non-profit organizations like ours. To achieve reproducibility, you must deal with challenges at three levels: technical, organizational, and scientific. I hope we are continuously learning and improving in these three aspects, so that the process is less painful for all participants doing this kind of research.


"Alice in Wonderland": When Top Models Fall Down the Rabbit Hole of Logic

Eric Wang: The "Alice in Wonderland" (AIW) research was a very fascinating part of your presentation. You are one of the co-authors, and the main author, Mariana, and another co-author, Mehdi, are also here at the GOSIM conference today.

Before we delve into this brilliant paper, I'm personally curious about one thing – who came up with such an imaginative name for it?

Jenia Jitsev: Oh, yes, actually, I did. I was so surprised when I saw the experimental results. To me, the behavior of these top models when dealing with such simple problems really created a bizarre, very strange environment. The name "Alice" was actually a bit accidental, because in computer science, when we discuss communication problems, we often use "Alice" and "Bob" as placeholders. But once "Alice" is used, it's easy to make people associate. Because we observed that these models, which claim to have powerful generalization capabilities, behaved so strangely on some extremely simple tasks, which made me feel very bizarre.

At that time, an illustration from the original book even appeared in my mind. I think it should be a scene from Through the Looking-Glass, the egg-shaped man named Humpty Dumpty. He sat on the wall, very authoritative and self-important, and the little girl Alice reached out to him, but he swayed and finally shattered. This image became an excellent metaphor in my mind.

Although I usually don't like those pretentious titles in our field, I couldn't resist this time, so I used this name.

Eric Wang: For the unfamiliar audience to quickly grasp it, the core finding of the AIW research is that even the top large language models exhibit extreme instability when faced with simple problems where the core logic remains completely unchanged, but only a few numbers are adjusted. When your team realized that these models had such fundamental difficulties, what was that "aha moment"?

Jenia Jitsev: To be honest, we didn't initially realize that simply changing the variables in the problem template could have such a big impact. At that time, some problems that were said to be able to "break" models were already circulating in the community, such as the logic problem about "Sally and her brothers" that circulated online in 2023.

Our problem was somewhat related to that, but also different. We initially tested only a fixed, static problem. But we found that even in this simplest case, powerful frontier models like GPT-4 sometimes yielded strangely low accuracy. I think it was when we kept exploring to find out which way of asking would make the model "fail" that we accidentally discovered this, and then realized: you actually don't need to change the problem much at all; you just need to change a few numbers, and it collapses. That moment was indeed our "aha moment."

We found that what caused the model to collapse was not a specific, tricky feature in the problem, but something more general. This indicates that there was a big problem with the model's generalization ability. For a fixed, simple logical template, it could solve it by filling in a set of numbers; then, if you change a slightly different set of numbers, it would completely fail. We are still trying to understand why its performance is so drastically affected.

Since then, our research path has been very clear. We generated a large number of variants, tried different problem templates, and constantly replaced the variables in them, always seeing the same consistent phenomenon. It wasn't until the emergence of new generation reasoning models (such as O1, O3) that we first saw models show some stability in handling these types of problems. But if you make the problem structure slightly more complex – far from the difficulty of Olympic or graduate-level tasks they claim to be able to solve – they will also quickly collapse.

Before the advent of reasoning models, you could use any of the strongest models on the market, such as Anthropic's Claude, GPT series, Mistral, Cohere's DBRX, Command R+, and without exception, they would all "fail" completely on these simple problems.

And the real problem behind this is that almost all these companies write on their official websites: "Our models are designed to help you solve serious, real-world business problems." If you see that such a simple, reasonable logical variation can cause the model to crash severely, you should understand that this is not just a flaw in the structure of the "Alice" problem. You can imagine any problem you want to solve with AI. You test it with a model and find that it works well, so you start to trust it. Then, you might make a tiny, what you consider perfectly logical, modification to the input, and it gives you a completely wrong answer without you even realizing it.

This situation is especially dangerous, especially when you don't know the correct answer yourself and are purely relying on the model to help you solve the problem. For the "Alice" problem, we can easily see that the model is wrong because we know the answer ourselves. But for a complex problem where you don't know the answer, the model will not only give a wrong response, but also confidently defend its wrong answer in a way we call "confabulations," telling you: "Yes, I've double-checked my solution, everything is fine."

Models like DBRX and Cohere, in particular, always give users a sense of "everything is under control" with their responses, but in reality, they might be wildly wrong. In this way, users won't even realize they're being misled.

For traditional large language models, this is definitely a huge problem. For new reasoning models, we still need to observe. We still see strong performance fluctuations, but the situation is indeed much better than traditional LLMs. A good sign is that we don't often see them giving those strange, overly confident wrong answers. In fact, when you see a reasoning model start to "struggle," its performance will start to fluctuate, but at the same time, it will tell you: "Oh, wait. Oh no, I'm not sure. Hmm... this is difficult."

This indicates that it has better self-reflection capabilities, or rather, it is better at calibrating its "not knowing." This gives us a glimmer of hope for the "reasoning model" path.


From Confabulations to Reflection: Is Reasoning the Way Out, or Another Beginning?

Eric Wang: When the AIW paper was first published, reasoning models weren't as popular as they are now. So your research actually spanned two stages: one before the advent of reasoning models, testing foundational models like GPT-4 and Llama 3; and another after the advent of reasoning models, where you tested them with the same problems and found that while they improved, they still performed poorly.

This is truly puzzling. What do these persistent flaws tell us about which path we should take in the future?

Jenia Jitsev: This is a very good question. A direct, simple explanation for this might be that we simply don't have enough data for training reasoning capabilities yet.

We are currently experimenting with our own OpenThoughts dataset for this purpose. This dataset initially had just over 110,000 "reasoning traces," and we have now expanded it to one million. Perhaps, if we continue to scale up this type of data, we can truly solve this problem. Maybe at some critical point, the model will suddenly exhibit robust generalization capabilities, at least for problems of a specific complexity. For example, the DeepSeek-V2-Coder model, which is fine-tuned on top of a base model using hundreds of thousands of additional reasoning samples. This amount may not be enough for the model to fully and robustly establish core reasoning capabilities.

But you do see clear improvements; the situation is getting better. So, the most intuitive approach, like all previously proven effective learning processes, is to simply scale up. Look, there was a Stanford study called LIMA, where they only used about 800 to 1000 high-quality reasoning traces to fine-tune a traditional large language model (Qwen-32B), and the performance saw a huge leap.

This naturally makes you wonder: since a small amount of high-quality data can bring about significant improvements, what if I expand this data volume to two million, five million traces? Will there be a qualitative change at some point, completely solving this problem?

Of course, we also need to be careful here. Fortunately, with open datasets like OpenThoughts now available, and efforts like Open-Mistral-v0.2 from the Hugging Face community, it's possible to verify the above conjecture. We are in a very favorable position: we can clearly see what the post-training data looks like. For example, we know that the Qwen-32B base model cannot solve the "Alice" problem; it performs very poorly on it, and we can be sure that its pre-training data does not contain similar problems. Then, you get completely open reasoning data for post-training and apply it. You can also check if this batch of data contains "Alice" problems – no, there is no data leakage. Therefore, all the model's performance improvements are due to exposure to these general reasoning traces.

So, I believe that before we explore more complex solutions, we must first verify this most direct path. Of course, you can think of many ways to fix this problem. However, scaling up reasoning data – this is completely feasible now, because current reasoning models are good enough to generate massive amounts of synthetic reasoning trace data, and you don't even need to painstakingly dig for real data.

Let's try to push this to the extreme first and see if simply scaling up the data is enough to make those strange performance fluctuations disappear. If it can, then we will have once again proven that this is ultimately a data problem. That would be really cool.


Outlook: The Future and Challenges of Open Communities

Eric Wang: As the GOSIM Paris AI Conference progresses, what specific expectations do you have for this conference and the connections you've made here? Also, where would you like to see next year's GOSIM Europe conference held?

Jenia Jitsev: These are all good questions, let me think. At this conference, I met friends from the Eclipse Foundation and the Linux Foundation, and we've always had a good connection with the Linux Foundation.

I hope that in the future, we can establish a more organized framework to better leverage the potential of the excellent talent that emerges in the open-source community in turns. I hope to work more closely with these large open-source foundations and help each other. This GOSIM conference provided a great opportunity for us to communicate with them face-to-face. As for the venue for GOSIM Europe... I have to say, Paris is actually the best choice.

Eric Wang: How about Germany? Your homeland.

Jenia Jitsev: Germany is in a difficult position. We have to be fair about that. Germany has a gap in machine learning research and industry. We have to fix it. France is doing much better. But LAION is there. Black Forest Labs is a big hope for Germany; they have very strong and capable people. But we need to get the policies right in Germany so that these capable people can have good conditions for development.

If GOSIM were to come to Germany, I would of course have a preference. Cologne would be a nice big city, a city full of joy, offering people plenty of opportunities to walk around. On the other hand, our supercomputing center is nearby. It's also easy for people from other parts of Germany to get to Cologne because it's a very well-connected hub. Another obvious choice is Berlin. But honestly, the machine learning scene in Berlin is somewhat in decline. I don't know why; it's a mystery to me too. But I guess a gathering hub like Cologne would be a good choice. I think many people could go there, as it's easy to reach and an interesting place to visit.

Eric Wang: Thank you so much, Jenia, for your insightful and inspiring contributions. We hope our audience found it valuable. For those who want to learn more about your work or connect with you, what are the best channels to find you?

Jenia Jitsev: We, of course, use Discord a lot. Many open-source communities are active on Discord. For whatever reason, it's a bit of an accident. You can come to LAION's Discord server; my Discord ID is the same as my name. You can always ping me there. You can also exchange emails; my official work email at Forschungszentrum Jülich is also a good address to contact me. I don't actually use LinkedIn much; I might only see many LinkedIn messages a month later – or come to top conferences like NeurIPS, ICLR, ICML, CVPR. We'll be there, and we'll communicate there. That's also a good opportunity.

Eric Wang: Thank you, Jenia. Please continue to follow GOSIM's official X and YouTube channels, as well as the OpenAGI forum, for more exciting content and updates. Thank you, Jenia. Much appreciated.

Jenia Jitsev: Thank you.

Everyone is welcome to continue following GOSIM's official accounts on platforms such as X, YouTube, and Bilibili to get cutting-edge AI research updates and exciting conversations.