I don’t know about Gemini, but many LLMs just build a response word-by-word by predicting the most likely next word (next token prediction). They learn by improving the accuracy of the prediction. This can be a good way to generate human-sounding responses, but may not strongly reward mathematical accuracy, unless the model has been trained on the exact question being asked, e.g. “convert 3’10” into cm.”
There are other types of AI algorithms/models that have better potential for certain kinds of problems, like those we are discussing. People are just using the LLMs for everything now because they are widely available and already “trained.” I’m sure that we will eventually learn when to use LLMs and when not to use them. But at the moment, it’s the only option that most of us have.