The Israeli start-up company Baz was ranked first in the accuracy index of tests for reviewing code written by AI, as part of the Code Review Bench index that was recently launched. The ranking places Baz ahead of the world’s leading AI labs, including OpenAI, Anthropic, Google and Cursor. In addition, the company was ranked second in the weighted index, which also includes both accuracy and coverage.
The Code-Review Bench index is the first of its kind in the world, and focuses on the field of reviewing the quality of code written by AI. Similar indices such as the popular SWE-BENCH were developed to measure the progress of the latest models in performing coding tasks, but were found to be less reliable, since the models were trained to beat them. Companies operating in the category conducted internal comparison tests, but naturally, the results in the market were received with skepticism. This is the first time that an objective comparison has been conducted by an independent body.
The company Baz was founded at the end of 2023 by entrepreneur Guy Eisenkot (the son of former IDF Chief of Staff Gadi Eisenkot), and Nimrod Kor, who served together in Unit 8200 and share a joint background in the cyber field. Guy was among the founders of Bridgecrew, which was sold in 2021, two years after its establishment, to Palo Alto for a sum of 200 million dollars. After the sale he served as Vice President of Product Management and was responsible for application security at Palo Alto. Nimrod was the third employee and later a group manager at Palo Alto. Among the investors in the company are Battery and Boldstart as well as the funds Vermillion, Secret Chord and Fusion.
The new Code-Review Bench index was developed by researchers who worked on developing advanced models at Google DeepMind, Anthropic and Meta as part of the work of a research lab in San Francisco. The lab team examined how models truly and fully understand mechanical intelligence. The company operates from the perception that building models through trial and error is not equivalent to a scientific understanding of them. For this reason, the company is now developing indices to understand the real intelligence behind the adoption of code-writing technologies with the help of artificial intelligence.
The new ranking will be updated every month and is based on a combination of controlled measurement and behavioral measurement. In the controlled measurement, the review tools of the different companies are activated on the same code changes and compared to a verified problem set. In the behavioral measurement, the researchers analyze how developers actually respond to comments in review tools in open code repositories.
The combination between the two approaches is intended to reduce the gap between the theoretical measurement of the agents and their real value in coding tasks. The methodology is continuously updated, includes a monthly refresh of the data, control over biases of automatic judgment models, and a constant expansion of the problem set in order to prevent “locking” of the results or artificial alignment with the index. The reason for this is the known problem in which tools learn “to beat the benchmark” instead of improving reality, through anchoring to behavioral indices and full openness of the methodology.
The Baz start-up develops artificial intelligence tools for automatic code review, which help development teams identify problems in code and suggest fixes according to rules and adjustments defined by the team. The product solves the frustration of repeated manual code reviews, improves code quality and streamlines collaboration in development teams.
“The accuracy index, in which we were ranked first, is calculated according to the rate of reviews on which developers actually act, and therefore it reflects a ratio between a definite finding and unnecessary ‘noise’ of alerts in the real world”, says Guy Eisenkot. “In reviewing code written with the assistance of AI, accuracy is a condition for adoption – If the tool produces too much noise, developers stop listening, but if it is consistent and accurate, it becomes a natural part of the workflow. Leading this index strengthens our central assumption that software developers need a tool that prefers quality and high signal over the quantity of comments”.
“This is the launch of an evolving benchmark. Baz currently has a smaller sample of measured requests compared with some of the veteran players, and therefore there may be a change in the rankings as the volume of data grows. In addition, the accuracy index is based on actual developer actions, which is a strong but not perfect indicator of technical quality. The judgment mechanisms and the definition of ‘what is a problem’ also improve over time, and therefore the results may be updated as the methodology improves. We see this as a significant indication of the right direction, but not an end point, and we will continue to monitor the performance as the benchmark develops.”
Beyond the product itself, Baz invests in independent research in the field of measuring the quality of code produced by artificial intelligence, breaking down complex changes into clear topics, and identifying logical failures and interface changes that may break compatibility in the software world. Among its customers are leading technology companies in Israel and around the world, including the leading cyber companies in Israel, which operate in cooperation for the responsible adoption of artificial intelligence by secure development organizations.