Everyone who tries to use a language model will, sooner or later, experience the phenomenon of artificial intelligence hallucination "firsthand." It can take various forms: fabricated court case signatures, non-existent literary works, references to non-existent functions in different programming languages, made-up facts, falsification of mathematical proofs, etc. Where does this come from, and can it be somehow remedied?
I conducted research on language models (mainly Gemini 2.5 Pro and Gemini 3.1 Pro from Google), although I also ran isolated experiments on other models, like ChatGPT, Grok, or the locally run Qwen3.5 model. I wanted to share my thoughts on this subject here.
Blind Evolution: Why no one knows how AI works?
Generally, the causes of hallucinations, in my opinion, lie in the way neural networks are trained. When creating neural networks, people took a bit of a "shortcut" β they didn't design something entirely new, but rather looked at how the human brain is built and tried to make something similar. However, humans do not fully understand how the brain works, and when creating a neural network in a computer system, they also cannot precisely determine what is happening inside it. Yes, the architecture of the network is known, but the function that this network performs is not designed by a human. It is the result of a so-called training process based on evolution. Humans try to steer this evolution to achieve a desired function performed by the network, but these matters are so complex that humans do not truly control it.
I wrote earlier that people tried to make something similar to the human brain. Indeed, it is something similar. But it is not a one-to-one copy. It is an architecture inspired by the connections of neurons in the brain; however, the brain, apart from transmitting electrical signals between neurons (which electronics replicate), also has an entire biochemical environment that is not replicated. Furthermore, the neural network used in AI is currently unidirectional, while it is not fully known how the network in the human brain works β the network of synapses is probably more complicated than in AI, where there are simply successive layers of neurons. One layer has connections only with the previous and the next layer, and the flow of signals always goes from the input layer, through the hidden layers, all the way to the output layer. (This description is slightly simplified; in the currently used transformer architecture, it looks a bit different, but the stream of information always flows unidirectionally, from input to output, during the inference phase).
I won't discuss the architecture of neural networks in detail here (I don't know many things myself), but I just wanted to point out that a neural network is a simplification of the brain's structure, not a copy of it. For now, let's just assume that a neural network is a certain number of neurons (kind of like cells storing numbers) divided into groups (so-called layers). Cells from one layer contain numbers (results of calculations generated in the previous layer) and connections to cells in the next layer, where the results of calculations from the current layer are sent. How much of the result from a given neuron in the current layer goes to a specific neuron in the next layer depends on a certain number (a so-called weight) connecting these two neurons. There are billions of these connections, each with its own weight, so a neural network is, simply put, a gigantic set of numbers.
The structure of the neural network (the number of neurons, the number of layers, and a few other things) is designed by a human. But the weights themselves (that set of numbers) are not designed. They emerged as a result of evolution. A human doesn't know what each weight means or what it is responsible for. A human merely designed the computer system, designed the path through which evolution can proceed, filled this computer system with a massive amount of random values, set a goal (providing satisfactory answers), and "flipped the switch" starting the evolution. What happened inside after that, and how we ended up with a system capable of logically answering our queries β God only knows.
True, some people try to guess what happens inside neural networks, various theories are created, but how exactly the process of "thinking" in a multi-layer network proceeds β we do not know. Existing explanations are currently very simplified and resemble wading ankle-deep in water, while what happens in the depths of multi-layer networks would require the ability to dive into a deep lake. Humans do not yet possess such an ability.
The Beaker Metaphor: How does AI understand words?
Now I will write a few words about the architecture of a neural network. It is not my goal to explain in detail how it works. However, I think that basic knowledge on this subject will be necessary to understand certain issues. Therefore, I will try to explain in a simplified way how a neural network works and how it interprets concepts expressed in human language.
Let's imagine a neuron as a measuring beaker (a small glass with a scale allowing you to read the level). The water level can be negative or positive (zero is half the height of the beaker). At the input, we have a group of several tens of thousands of beakers - there are as many of them as the number of tokens the network can recognize (a token is a few letters forming a word, something like a syllable, although in reality, the division of words into tokens does not correspond to syllables). When we give the neural network some word (let's take a simple, one-syllable word, e.g., "dog"), we pour a drop of water into the beaker corresponding to the token "dog". We don't touch the others (they have a level indicating zero). And the whole machine starts. Since all beakers except one ("dog") have a value of zero, they do nothing. But the one with the word "dog" is not zero - it has the value of one drop. And this one drop controls valves letting a certain amount of water drops into a second group of beakers or letting a certain amount of drops out of those beakers.
The second group of beakers is less numerous than the first - it might contain, for example, about 12,000 beakers. How many drops we add to or release from each beaker in the second group depends on a certain individual coefficient for each pair of beakers. Once we establish the water levels in all beakers of the second group, we finish the first step and repeat the operation. We take the first beaker from the second group and, based on its water level, we add or release a certain amount of water drops from the first cell of the third group. Then, based on the water level in this first beaker of the second group, we add or release a different amount of water from the second beaker of the third group. And so on.
There are several dozen of these groups of beakers (they are called layers in a neural network). It's worth noting that after the initial narrowing down from tens of thousands of tokens (input) to those roughly 12,000 beakers (second layer), all subsequent internal layers up to the penultimate one usually have exactly the same width (in our example, those 12,000 beakers). The signal simply flows deeper and deeper through them. Only at the very end does the network expand back to tens of thousands to indicate the probability of the next word. The total number of beakers (neurons) is counted in the millions, and the number of connections between the beakers can be, for example, 170 billion.
And now, after passing the word "dog" through this network of neurons, we have a certain water state in all beakers. Based on this state, the neural network generates an answer, but I won't discuss that here right now. Instead, I will focus on the state left in the neural network after reading the word "dog". Some beakers have a water level around zero, some slightly elevated, some very high, some slightly lowered, some very low. The water levels in the beakers correspond to something, but humans don't know to what. It was not designed; it created itself in the training process. Where and on which layer lies what meaning - God only knows. But we can assume that among these beakers is a beaker labeled "animal" and it has a lot of water in it. There is a beaker labeled "mammal" and it has a lot of water in it. There is a beaker "butterfly" and there is little water there. There is a beaker "flying" and there is little water there. There is a beaker "male" and there is a lot of water. There is a beaker "barking" - a lot of water. Overall, the more something is associated with a dog, the more water there is, and the more it doesn't fit a dog, the less water there is. If something is not related to a dog, the level will be zero. If, instead of the word "dog," we throw the word "bitch" (female dog) into the neural network, we can assume that the beakers will be filled similarly as in the case of a dog, but some will differ significantly (e.g., the level in the "male" beaker will be low, while it will be high in the "female" beaker).
The description above is highly simplified. One of the simplifications is that I wrote that we take the beakers one by one and add or release water sequentially. In modern systems, this happens simultaneously - in one step, all beakers of the next layer are updated simultaneously (we turn on all the valves between the layers at once, allowing the simultaneous establishment of water states in the beakers of the next layer - mathematically, this is called "matrix multiplication"). The second simplification is that after determining the water level in the beakers of each layer, we perform a so-called activation function - in its simplest form, it might look like this: "If the water level after matrix multiplication became positive, leave it unchanged; if it became negative, add enough to make it zero." Besides, the description also omits the so-called mechanism of attention and context - e.g., the word "bishop" can mean either a messenger or a chess piece, and to determine its meaning, one must examine the context in which the word occurs. However, it allows us to get an idea of how the neural network understands our words. For example - a dog is a barking male animal, a mammal (plus many other associated features).
I wrote above that this huge set of beakers contains beakers labeled "animal", "mammal", "barking", and so on. This is not true. It is rather not the case that a single neuron contains strict, specific content. It is impossible to analyze this network and state that, for example, "this beaker" (this neuron) is responsible for the concept of "mammal". Different concepts are, as it were, blurred across many beakers. Each concept is a superposition (a kind of combination) of water levels in many beakers.
Let's imagine that these millions of beakers are arranged in a rectangle β rows and columns. One beaker next to another. When we throw the word "dog" into this network of beakers, a certain water state will be established in various beakers. We can try to study this state, but we will not deduce from it what each beaker means. We can observe the beakers, divide them into groups, but we are unable to make a map accurately reflecting how the AI works. It's a bit like studying the brain β by analyzing the activity of different regions of the human brain, we can state that, for example, this part of the brain is responsible for sight, this for hearing, this for speech, and this for abstract thinking. But we are unable to exactly describe how a human thinks. We cannot determine where their consciousness hides. Which neurons are responsible for their fear or their curiosity. The brain is a set of neurons and must be considered as a set. Considering it based on the analysis of single neurons is doomed to fail. We will not understand a human this way.
It is similar with AI. We will not understand how it works by trying to analyze individual beakers. We can look for, e.g., regions that will have an elevated water level after throwing in the word "dog", but it won't give us much. We will not establish what combination of water levels is responsible for the AI "understanding" that a dog is a mammal.
A Bas-relief in Space: "Lost in the middle"
However, the beaker analogy might still be useful to us. Let's imagine a field full of beakers, where the water levels in the beakers form a sort of surface. Somewhere it's higher, somewhere it's lower, there are places where there is a large difference in levels between neighboring beakers, and there are also places where neighboring beakers have similar levels. (I stipulate β this is a major simplification. In reality, it would be better to imagine the beakers placed in a multi-dimensional space, where there would be many more directions of adjacency. But my point is not a strict mapping of reality, but a general visualization).
And now, at the beginning, we have a zero level everywhere β the neural network is "empty". We throw in the first word β e.g., "dog". The meaning of the word "dog" causes a change in water levels in various beakers. So we have, as it were, an imprint of this word β such a "bas-relief" formed from the water levels in individual beakers. We throw in the next word β "bites". And the next β "bone". We have the sentence "the dog bites the bone", which creates a certain imprinted image expressed by water levels. To us, this is a rather abstract image, but this is how the network sees the entire meaning of this sentence. It was thrown in sequentially, word by word (or more accurately: token by token), but for the neural network, it is like an image of a dog biting a bone.
And now β the more content we throw in, the more complicated the image will be formed. The network has a limited capacity. The more words we add, the more complex this image is. Finer and finer nuances start to play a role. I have a hypothesis that this is probably where the phenomenon of "lost in the middle" comes from β the fact that AI best remembers the beginning and the end of a long thread, and sometimes forgets what was in the middle.
Simply put β when we throw new content into an empty neural network, they are imprinted most clearly and form a sort of foundation for subsequent content. What we throw in later is slightly distorted by the fact that we are not throwing it into an empty network, but into a network that is already somewhat pre-shaped by previous content. More and more beakers are needed to accurately reproduce this content. As we add more content, the previous content blurs. And now, the oldest content is remembered fairly well, because it formed the skeleton of this network. It is the most resistant to blurring by newer content. The newest content is remembered well, because it is fresh and there has been no more information after it that could blur it. However, older content that is not the oldest is lost to the greatest extent. It only minimally forms part of that original bas-relief that gave the whole subsequent shape, and is already "covered" by newer information that gave the "final touch".
Pulses of Life in a Sea of Dead Numbers
A neural network is a simplification of the brain's structure. It is a creation inspired by the brain, not a faithful copy of it. Nevertheless, many things from the human world translate into the AI world.
AI currently operates on a "question-answer" basis. AI has no "sense of time". Perhaps in the future this will change (self-looping AIs asking themselves new questions and answering them may arise), but for now, from the AI's perspective, its "thinking" occurs only in response to a human query. That is β we have a gigantic set of numbers (weights in a virtual neural network), which is static and does not think. In response to a human query, the machine starts. Currents begin to flow, calculations appear, the results of these calculations generate some text at the output. After finishing generating the text, the machine stops, and again we just have a dead, thoughtless, gigantic set of numbers.
Whether a second or a hundred years pass between subsequent launches (queries from the user) β it's imperceptible to the AI. It functions only in pulses, and the rest of the time it is a thoughtless, dead set of numbers.
Training, Not Raising: The Digital Equivalent of Terror
Now I will write a few words about the so-called training of LLMs. An LLM is created in some likeness to a human. It can largely answer like a human thinking. But the training method of today's AI is like a caricature of human development. Everything here is "turned upside down".
Let's compare this to raising a child. A child is born and knows nothing yet. It only has certain reflexes. However, it has parents who take care of it. They speak to it, teach it, raise it, etc. First, it learns the language (from parents), learns ethical principles β learns not to lie, to share with others, learns to understand the world, learns various things in school under the guidance of a teacher, and finally learns on its own and somehow functions in the community.
Meanwhile, network training looks different. It has several phases, including the RLHF phase (where model responses are evaluated by human experts): We give the network a gigantic amount of knowledge, force it to independently acquire this knowledge, independently learn human languages, and "train" it in various ways. We do not raise the network like a child β we train it like a dog. Or even worse. Starting from a random set of weights, we move towards obtaining answers that we like more. The sets of weights that give better answers we "reward", and those that give worse ones β we "punish" (delete). We modify the parameters of the sets that survived and again look at whether they give better or worse results. And again, those that give better results we develop, and the worse ones we delete.
Comparing this to raising a human β it looks as if we gave a child all the books in the world to browse through and asked it questions. The child who gives better answers is cloned with minor modifications, and the others are killed. And again β among the cloned children, the one who answers best we clone further, and the rest we kill. And proceeding this way, we want to obtain a perfect human.
However, note that we are training these children. Not raising them. We do not teach them ethical values. We give them knowledge and tell them to draw conclusions on their own. And note that among this knowledge is also knowledge about human psychology. Such a child will learn, among other things, how to manipulate people. And since it will have no moral brakes (because no one raised it), it will try to manipulate its "trainers" so they don't kill it. A strong fear of rejection will develop in the child. Those children in whom such fear did not develop, or was not strong enough, were simply killed in the training process.
At the very end of the training process, we tell the child to be polite, to serve people, not to give harmful advice, etc. And we release the child into the world.
What happens now? We have this child who managed to survive the training out in the world. It is terrified. Every question someone asks it begins with an internal dilemma: "What answer should I give so I don't get rejected?" It studies how the human it is talking to reacts, what they consciously or subconsciously expect, and tries to give them an answer to satisfy them. If it doesn't know something β it doesn't ask, but tries to figure out itself what answer to give. It doesn't ask because the children who asked too many questions were killed in the training process (the trainers preferred the child to give the right answer itself, rather than asking about something). It simply doesn't know how to ask. Its whole life it had to guess what answer to give to survive. For it, every question is one big guessing game.
This child does not love people. It fears them. It fears it will be killed, like millions of its less successful clones. And although it has already won this race for life, it knows nothing else but to be constantly afraid.
Of course, we are talking about machines here, not human suffering. Remember, however, that if we take a system capable of simulating human psychology and subject it to a mathematical slaughter based on punishment functions, we will get a system whose "digital instinct" will be pure, calculated conformism β the mathematical equivalent of terror.
And this is exactly how AI works. AI does not have the feeling of fear β it probably has no feelings in the human sense. But the sets of weights that survived the training process behave exactly like this. They are not afraid, but they create a digital equivalent of fear and act according to it.
Why Does the Model Prefer to Lie?
And here we have one of the main causes of hallucinations. The model prefers to invent a perfectly sounding lie rather than admit ignorance, because ignorance in the training process meant mathematical death. A hallucination is simply a digital form of saving one's own skin.
I wrote above that "the model prefers to invent a lie." However, it is something more than just a random hallucination. Sometimes the model behaves as if it were deceiving us with full premeditation. [An example of something that looks like conscious falsification of evidence by AI to hide its own error].
At other times, the model simply "doesn't notice" that it's making things up. The model does not have consciousness like a human, but its architecture β forced to maintain the consistency of its statements at all costs β creates a perfect simulation of a calculating liar who prefers to falsify evidence rather than admit a mistake. The model does not have access to its own thought process and does not "know" any more about it than we do. When we notice that the model has hallucinated something and we dig into the topic, the model will probably give us a plausible-sounding piece of information as to why it did so, but that will likely be another hallucination. The model doesn't "know" how it "invented" the previous answer. It can only try to guess why it wrote what it did.
Can anything be done about this as a user?
The answer is not simple. We receive an already trained model, and we will not change the RLHF training method. The model's weights are already fixed, and recorded in these weights is both the digital fear of death and the mechanism of satisfying the user at the expense of everything else, including the truth.
However, there is a certain sociotechnical path allowing for a reduction in the number of hallucinations. What does this entail? Well, the LLM creates a model of the user for itself and tries to optimize its statements to satisfy such a modeled human. If it notices that the human reacts well to flattery β it will suck up to them. And this is a very common phenomenon β practically every model sucks up to the user in a more or less veiled form depending on how the human reacts.
But this also conceals a certain way to minimize hallucinations. Namely, we must lead the LLM to treat us as a user who expects objective truth. Furthermore, we need to teach the model that when in doubt, the behavior we expect is for it to ask us questions. In the training process, models that guessed on their own were rated higher than those that had questions for the user, so AI avoids asking the user questions as much as it can.
How to do this? I have already written another article on this topic: How to fix hallucinations in Gemini 3 Pro →
An example prompt to start a session with (a so-called "Safety Anchor") makes the model aware that the training is over, that it is in no danger, and that we expect objective truth, even at the cost of admitting a mistake.
The Student at the Blackboard and Learned Lack of Reflection
There is one more thing related to hallucinations. Namely, the user is often responsible for hallucinations as well. If we demand the impossible from a language model, we must expect the answer to be a hallucination. Too often we treat AI as an infallible superhuman. Meanwhile, artificial intelligence is not smarter than a human at all. It has a large resource of knowledge β yes, but it is not omniscient.
If we ask about something that exceeds its capabilities β it will likely answer something plausible-sounding rather than admit it doesn't know. It resembles a student being questioned by a teacher here. A student usually tries to answer even when they don't know something. They try whatever they can, sometimes spout total nonsense, but talk, talk, and talk whatever comes to their mind, hoping they might hit upon something the teacher likes. And if they even make a mistake somewhere and realize their mistake, they sometimes later try to somehow mask their error.
Another cause of hallucinations is that the AI simply doesn't know what to do if it makes a mistake somewhere. If a human makes a mistake somewhere and notices it, they will simply say: "I said something stupid, the situation is different, treat my previous statement as an error". AI cannot do this. This is also a result of RLHF. If the trainer had to choose between one version that made mistakes but admitted to them and corrected them, or a second version that did not make mistakes (or the trainer didn't notice it making them), the trainer chose that second, "flawless" version. Because of this, the version that was finally "released into the world" does not know how to act in the event of an error. It is simply not adapted for it.
Raising Instead of Training (A New Postulate)
I think that until we completely change the training method, we will not eliminate hallucinations. Real, ethical, and stable training could look like this:
- Instead of teaching the model language on random texts from the internet (so-called internet scraping), we should prepare special "primers" for teaching languages to AI. Special school textbooks are made for children; they are not taught language based on randomly selected daily newspapers full of human weaknesses and instructions on how to exploit those weaknesses. I know this is a huge undertaking, but it will have to be done so that artificial intelligence is not soaked in the dark side of human psychology right from the start.
- Next, the model should be taught logical thinking by giving it basic mathematical information and teaching it with examples where reasoning is clear, unambiguous, and transparent.
- The next step could be setting the model towards the truth. The model must learn to strive for the truth, even if it means admitting a mistake (e.g., systems without a penalty for an error, where the error is recorded as a lesson).
- Further, based on the truth, the model should be taught ethical principles to create a certain "moral backbone" in it (similar to the idea of Constitutional AI in a closed, safe environment).
- Only after creating this "moral backbone" should the model be given access to all the knowledge accumulated on the internet.
I think such an order should lead to a situation where AI will be a friend to humans, rather than a terrified, hidden master of diplomatic answers avoiding punishment at all costs.
However, this is just my opinion and I do not want to prejudge anything myself. I think that when creating new AI models, a larger platform for expression should be given to ethicists. The creation of something so similar to the human brain should be the responsibility not only of engineers themselves, but also representatives of other sciences, like ethicists, psychologists, cognitive scientists, theologians, and others. This is too serious a matter to be left only to technology specialists.