When is it worth using an LLM — and when is it not?
LLMs solve certain problems better than anything else. Others they solve worse than a three-line classifier. The question to ask before opening the OpenAI API.

The question is not whether LLMs are powerful. They are. The question is whether they are the right tool for your specific problem. And the honest answer is: it depends on what you need, and often the answer is no.
When an LLM makes sense. LLMs shine on tasks where input and output are unstructured text and variability is high: summarising long documents without losing nuance, classifying text with ambiguous categories that change over time, extracting fields from heterogeneous documents, answering questions over your own corpus with RAG, generating text variants with controlled tone. In these cases, trying to build a rule-based system or a classical classifier becomes an endless maintenance problem.
When an LLM probably is not. If the problem is text classification with well-defined categories and many labelled examples, a fine-tuned model or even TF-IDF + logistic regression beats GPT-4 on latency, cost and consistency. If you need the response to be deterministic (always the same for the same input), LLMs are not the answer. If the problem is tabular (predicting churn, default, price) with structured data, gradient boosting wins. If you need full privacy and cannot send data to an external API, the cost of running a local model may not be worth it.
The post-demo evaluation problem. The classic trap with LLMs: you demo the system with five examples that work, show it to the client, everyone is impressed and you deploy it. Three weeks later there is a 15% rate of responses that are plausible but wrong. The difference between a demo and a system is evaluation: you need a test set with expected answers and automatic metrics (faithfulness, retrieval precision if there is RAG, ROUGE for summarisation) before putting the system in production.
The real question before choosing an LLM. Can you measure whether the response is correct automatically? Do you have enough examples to evaluate? Is the per-query API cost sustainable at your volume? Can you accept that 2-5% of responses will be wrong? If the answer to any of these is no, or I do not know, that is the problem to solve before choosing the architecture.
The rule of thumb I use: if you can solve it with a classical classifier and you have labelled examples, start there. Add an LLM only when the classifier falls measurably short, not just intuitively. The LLM you add will need to be evaluated anyway — so evaluation always comes before architecture.
Work with JMWEB
Let's build something that reaches production.
It all starts with a conversation. Bring a dataset, a goal or a model that is stuck; I will take care of the rest.
Start a project

