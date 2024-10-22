The paper also highlights the need for better benchmarks in the AI industry. According to Minhas, current benchmark problems are flawed because models can solve them through pattern matching rather than actual reasoning. "If the benchmarks were based on actual reasoning, or if the reasoning problems were more complex, then all the models would perform terribly," he says.

Minhas says the Apple researchers created this synthetic dataset, a collection of data used to train and test AI models and algorithms, by mixing up the symbols



“They've proven that these models' performance degrades when you start tweaking and changing things in the input sequence, whether through the symbols themselves or extra context like superfluous tokens," he says.

The Apple study's methodology involved introducing various "fluffs" and clauses to the training set to observe how model performance changed. However, Jess Bozorg, IBM Data Scientist, points out a potential limitation: "They didn't specify how many categories of fluffs they considered in their additions, or what types of fluffs they used from which categories," she says.

One of the paper's critiques of current LLM benchmarks is the issue of data contamination. Bozorg explains that the Apple study used the GSM-8K dataset. set that contains grade-school math word problems created by humans. "There's data leakage,” she says. “This means that the model had already seen some of this data during the testing stage in their training."

Contamination is a widespread issue in the industry. Minhas says that the GSM-8K dataset “is such an industry benchmark that there are bits and pieces of it all over the training data that all models know about. This is a fundamental problem with all of these created benchmarks."

Interestingly, the study revealed that GPT-4 performed notably better than other models when tested on the new symbolic dataset. Minhas speculates on the reason: "Is it possible that when training GPT-4, they thought about symbolic representations and generated test data like that? Maybe it's still just doing pattern matching, but it had this data type in its training dataset."

Minhas points out that researchers are trying to move beyond pattern matching by introducing memory into AI systems. "That's one way we're trying to make them more general, but it's still only pattern matching based on what you've given it," he says.

The Apple study has exposed significant limitations in current AI systems, revealing that the journey toward truly intelligent machines is still far from complete. Now, experts say, the AI community faces the challenge of bridging the gap between pattern matching and genuine reasoning.

“The transformer architecture alone isn’t enough for reasoning,” Minhas says. “Advancements in model architecture are needed for reasoning capabilities.”