Why Prompt Engineering Is Reaching Diminishing Returns

We're hitting a wall with LLMs. The idea that we can just "chat" our way through complex workflows is starting to feel like a fantasy. As these interfaces become more common, I've noticed the friction of crafting the perfect prompt often outweighs the actual value of the output. It's a lot of cognitive overhead for a result that frequently misses the mark.

I saw this clearly last week when I stumbled upon several GitHub repositories that were actively spreading malware. I tried using an AI agent to help me figure out the best way to report and mitigate the spread, but the response was useless. It gave me generic, high-level advice that didn't help me take any real action. It was the same experience I had years ago working as a developer, asking a business owner a direct question about a task and getting a response that completely bypassed the technical reality of the problem.

The tech is getting better at mimicking conversation, but it's still failing at the hard part: actually being useful when the stakes are high. We need to figure out if we're building tools that solve problems, or if we're just building much more expensive ways to ask the same frustrating questions.

The Illusion of Conversation

The "chat" metaphor is a bad way to think about LLMs when you're trying to get actual work done. We've been conditioned to treat these models like people we're messaging on Slack, but that's a mismatch. In a real conversation, context is implicit and much of the meaning lives in the subtext. With an LLM, if you don't explicitly define the constraints, the model will hallucinate a context that doesn't exist. This creates a massive cognitive load for developers because you're not just writing code; you're managing a persona and a sliding window of memory that's constantly shifting.

This gap between human intent and machine instruction is where most prompts fail. You might think you've asked for a "summary," but the model sees a request that lacks a defined format, tone, and length. To bridge this, you have to stop "chatting" and start structuring. The most reliable way to handle this is to treat your prompt like a configuration file rather than a text message.

prompt_template = """
Task: Summarize the provided text.
Constraints:
- Use exactly 3 bullet points.
- Tone: Technical and dry.
- Format: Markdown.

Text: {input_text}
"""

formatted_prompt = prompt_template.format(input_text="The API returns a 404 error when the resource is missing.")
print(formatted_prompt)

Managing this context is genuinely confusing because the model's "memory" is just a fixed number of tokens. As your conversation grows, the earliest instructions—the ones that actually define the rules of the task—often get pushed out of the window. If you're building an agent, you can't rely on the user to "remember" to stay on track. You have to programmatically inject the core instructions into every single turn of the loop.

Moving Beyond the Chatbox

Chat interfaces are great for prototyping, but they're unreliable for production systems. When you rely on a natural language prompt to trigger a specific action, you're essentially gambling on the model's ability to follow instructions perfectly every time. LLMs are probabilistic, meaning they predict the next likely token, not the next correct instruction. This makes them prone to "hallucinating" parameters or ignoring constraints when the context window gets crowded.

Programmatic integration via structured inputs removes this uncertainty. Instead of asking a model to "extract the date and price from this text," you use tools like Pydantic to enforce a schema. This forces the model to output valid JSON that fits your application's requirements. It's a shift from hoping the model understands your intent to providing a deterministic interface that your code can actually parse without crashing.

from pydantic import BaseModel, Field

class InvoiceExtraction(BaseModel):
    vendor_name: str
    total_amount: float = Field(description="The total amount in USD")
    due_date: str = Field(description="ISO 8601 formatted date")

def validate_output(raw_json: dict):
    return InvoiceExtraction(**raw_json)

The real work happens when you move from a single prompt to an API-driven workflow. In this setup, the LLM isn't the entire application; it's just one component in a pipeline. You use the model to transform unstructured text into structured data, then pass that data to traditional, deterministic functions. This approach is much harder to build because you have to manage state and error handling between the model and your backend, but it's the only way to build software that doesn't break every time the model updates.

The Prompting Tax

I think the term "prompting tax" accurately captures the hidden cost of current LLM workflows, but it's often applied too broadly. We tend to talk about it as a latency issue or a token cost issue, but the real friction is cognitive. It’s the mental overhead of having to learn a new, imprecise language just to get a predictable output. When you spend more time engineering the instructions than you would have spent writing the logic yourself, the utility of the model starts to evaporate.

This matters for developers building automated pipelines, but it probably won't affect casual users much yet. If you're just asking a chatbot to summarize an email, an extra five seconds of prompt tweaking is negligible. But if you're trying to build a reliable agentic workflow where the prompt needs to handle edge cases without breaking the downstream parser, that "tax" becomes a massive technical debt. You're essentially building a brittle layer of natural language that is difficult to version control and even harder to test.

I'm not convinced that better models will solve this by simply "understanding" us better. Even with much higher reasoning capabilities, the need for precise constraints remains. We might move away from long, rambling instructions toward more structured schemas, but the underlying problem—the lack of a deterministic interface—isn't going away. I wonder if we'll eventually see a shift where we stop trying to "prompt" altogether and instead focus on building much smaller, highly specialized models that don't require any linguistic hand-holding to stay on task.

Conclusion

We’ve spent the last year treating a text box like a magic oracle, but the friction of constant prompt engineering is starting to outweigh the utility. If we're stuck manually explaining context and formatting requirements every single time we hit enter, we haven't actually automated anything; we've just traded one type of syntax for another.

I'm still not convinced that the chat interface is the right way to build this. We need to stop asking how much better the model is and start asking why the interface requires so much hand-holding. Check your current workflow: if you're spending more time refining the prompt than you are reviewing the output, the tool is failing you.

Search This Blog

Tech Radar