Japanese researchers from Osaka University presented the results of a unique experiment that demonstrated that generative models of artificial intelligence Vision Transformers (ViT) can develop visual processing skills similar to human ones. These abilities emerged in the models spontaneously-without explicit instructions or predefined filters, thanks to a specific training method.
As part of the new study, the researchers applied a self-supervised learning technique called DINO (self-distillation with no labels), which allowed models to independently form the mechanisms of perception of visual scenes. Instead of setting AI fixed rules, the scientists allowed the systems to learn visual information in a natural environment by analyzing a vast array of video content.
Lead author of the study, Dr Takuto Yamamoto, explained: "Our models didn't just randomly switch between image elements. They spontaneously developed specialized functions. One group of models learned to consistently focus on faces, another — on the contours of shapes, and the third-on the background. This reflects the same segmentation and scene perception strategy that is typical of the human visual system."
To test the hypothesis, the researchers compared the models ' visual strategies with data obtained from tracking eye movements in people who watched the same video clips. The results were striking: the models trained by the DINO method showed behavior almost identical to that of humans. In contrast, systems that used traditional algorithms with fixed filters showed unnatural and fragmented ways of image perception.
Particular attention was drawn to the fact that none of the models received preliminary instructions on which objects should be considered significant. However, AI independently began to give priority to individuals, which, according to scientists, is associated with their high information content. Senior author of the study Professor Shigeru Kitazawa noted: "This is strong evidence that self-supervised learning can capture something fundamental about the nature of learning intelligent systems — both artificial and biological."
Further analysis confirmed that VIT models trained with DINO not only formed structures similar to human visual perception, but also quantitatively reproduced typical patterns of gaze fixation. This was especially evident in scenes involving humans, where the overlap between human and AI behavior was maximal.
This research raises new questions about the limits of artificial intelligence's ability to understand and interpret the world around us. The results obtained at Osaka University not only bring us closer to creating truly “sighted” machines, but also open the way to a better understanding of the very process of human perception.