Israelis have taken to the use of artificial intelligence like a duck to water. This country is among the global leaders in per capita AI usage, along with Singapore, Australia, the United Arab Emirates, and South Korea. In some measurements, Israel’s usage appeared far above what its population size would predict, indicating unusually intense adoption – but those studies usually measure usage of a specific platform or dataset, not all AI worldwide.

About a third of Israeli adults and nearly 38% of businesses regularly use AI tools. Experts explain that this is due to the very large tech sector relative to population (20% of the economy), a significant share of knowledge workers, the “Start-up Nation” effect, strong STEM (science, technology, engineering, and mathematics) education, and military tech training.

Now, researchers from the computer science department of Bar-Ilan University (BIU) in Ramat Gan and from Israel’s NVIDIA AI research center have developed a new method that significantly improves how artificial intelligence models understand spatial instructions when generating images – without retraining or modifying the models themselves.

“Modern image-generation models can create stunning visuals, but they still struggle with basic spatial understanding,” said BIU computer science expert Prof. Gal Chechik, who also works at NVIDIA. “We study learning in brains and machines. To teach machines to generalize from examples, we develop algorithms to represent complex signals in a meaningful way,” he told The Jerusalem Post in an interview.

“Our method helps models follow spatial instructions more accurately while preserving their general performance.”

THE NEW method that significantly improves how artificial intelligence models understand spatial instructions when generating images.
THE NEW method that significantly improves how artificial intelligence models understand spatial instructions when generating images. (credit: BAR-ILAN UNIVERSITY)

Such classifiers pose a surprising challenge: they can take shortcuts by detecting linguistic traces in cross-attention maps rather than learning true spatial patterns. “We solve this by augmenting our training data with samples generated using prompts with incorrect relation words, which encourages the classifier to avoid linguistic shortcuts and learn spatial patterns from the attention,” Chechik explained.

Details and implications of the study

The study was due to be presented by the developers during the Winter Conference on Applications of Computer Vision (WACV) 2026 that begins on March 6 in Tucson, Arizona, with the title “Data-driven loss functions for inference-time optimization in text-to-image.” However, because of the closure of Israeli airports, this is in doubt, he said. “Maybe another Israeli who is there can present it for us.”

New discoveries in AI are rarely published in peer-reviewed journals; instead, they are reported at conferences devoted to the field. This is because the developments occur so fast, he said. Competition is very strong, and publication in a journal takes a long time.

Although developed for a specific use, Chechik said that the technique developed in his lab could also be used in other fields, such as working with molecules to create new drugs and materials.

Before joining NVIDIA, Chechik was a staff research scientist at Google and a postdoctoral research associate at Stanford University in California. He received his PhD from the Hebrew University of Jerusalem and has published 160 papers, including in Nature Biotechnology, Cell, and PNAS, and holds 50 issued patents.

Spatial instructions involve using directional, locative, and relational language (such as “on,” “under,” and “between”) to guide movement, describe object locations, or build mental maps. Key techniques include modeling concepts during play, giving multi-step directions, and using obstacle courses to improve spatial awareness.

Frequently placing objects incorrectly or ignoring spatial relationships altogether, text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial, such as placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual – a giraffe above an airplane – these failures become even more pronounced.

Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, the 10-person team proposed learning these objectives directly from the model’s internal representations.

The research team has introduced a creative solution that allows AI models to follow such instructions more accurately in real time. The “Learn-to-Steer” method learns how spatial relationships are encoded in attention maps to guide generation, correctly renders all four orientations (above/below/left/right), and handles complex scenes with multiple spatial relationships among up to five objects and three relations.

It works by analyzing the internal attention patterns of an image-generation model, effectively offering insight into how the model organizes objects in space. A lightweight classifier then subtly guides the model’s internal processes during image creation, helping it place objects more precisely according to user instructions. The approach can be applied to any existing trained model, eliminating the need for costly retraining.

Their findings open new opportunities for improving controllability and reliability in AI-generated visual content, with potential applications in design, education, entertainment, and human-computer interaction.

The company is speedily expanding its footprint in Yokne’am, with plans for making it a core global R&D hub for AI and networking, following its acquisition of Mellanox. In the next two years, it will include a massive new 29,000-square-meter office tower, expanding its space to 68,000 square meters, alongside developing a major data center for advanced, next-gen AI chips. NVIDIA is investing over $500 million in the new computing facility.

The existing Yokne’am facility, with 3,000 staffers, already leads in the development of three of the company’s four major product lines: AI processors, CPUs, and networking chips. It is also developing a separate, larger campus in nearby Kiryat Tivon, due to open in 2031.

Chechik explained that the team’s main contributions are showing how the loss function for test-time steering can be learned from data rather than be handcrafted; identifying and solving the “relation leakage” problem, enabling effective training on cross-attention maps; demonstrating significant improvements over handcrafted losses on standard benchmarks across four different diffusion models without any model fine-tuning; handling multiple simultaneous spatial relations in a single image; and introducing an extended evaluation scheme for quantifying multiple relation generation.

Sapir Yiflach, the study’s lead researcher and co-author alongside Chechik and Dr. Yuval Atzmon, of NVIDIA, summed it up. “Instead of assuming we know how the model should think, we allowed it to teach us. This enabled us to guide its reasoning in real time, essentially reading and steering the model’s thought patterns to produce more accurate results.”