Tuesday, May 20, 2025

CAN LARGE LANGUAGE MODELS HELP DESIGN EXCELLENT USER EXPERIENCES?

 INTRODUCTION


Exceptional user experience lies at the very heart of any software product that users choose to adopt and return to. When an application feels natural and responsive, people waste less time wrestling with controls and more time accomplishing their goals. Over the years, engineers and designers have refined methods for eliciting user needs, sketching wireframes, and validating prototypes. Yet these steps often require substantial effort: user interviews must be transcribed and analyzed, textual copy must be reviewed for clarity and tone, and front-end code must be written and iterated.


In parallel with these traditional practices, large language models have emerged as a novel point of assistance. They excel at understanding and generating natural language, which suggests they could help in a variety of UX tasks. For example, a model might read a rough user story and propose interface labels that are concise and consistent. It might summarize hundreds of survey responses into coherent themes. It might even generate snippet code for a search-filter component or draft example JSON for a design token configuration.


This article investigates whether such language models can genuinely advance the art and science of user experience design. It begins by establishing the foundational principles that underpin good UX, then explores the specific capabilities of modern transformer-based models. After examining their promises, it turns to the practical requirements—both software and infrastructure—that must be met to integrate these models into an engineering workflow. Alongside this discussion, annotated code examples will illustrate how a team might call a model to generate UI text or to parse user feedback. Finally, the article addresses current limitations and outlines directions for future exploration. By the end, a software engineer will have a clear picture of both the potential benefits and the essential prerequisites for employing large language models in pursuit of outstanding UX.


FOUNDATIONS OF USER EXPERIENCE DESIGN


User experience design begins with a deep understanding of the people who will interact with a product and the situations in which they will use it. Every decision about screens, controls, labels, and feedback must serve the user’s underlying objectives. A feature that seems clever in isolation may feel cumbersome if it forces the user to learn an unfamiliar pattern or to navigate unexpected workflows. Good design anticipates users’ mental models and reflects familiar conventions from other tools in the same domain, so that the cognitive load of learning a new application is minimized.


Designers organize functionality through an information architecture that groups related tasks under coherent headings and navigational structures. This organization emerges from research into the user’s goals and the language they use to describe those goals. When users think of “orders” and “invoices” as part of the same workflow, it is confusing to scatter those sections across disconnected tabs. Conversely, grouping tasks under labels that align with user expectations allows them to move fluidly from one screen to another without pausing to reinterpret terminology or to hunt for buried settings.


Interaction patterns convey how users perform common actions, such as selecting items, filtering lists, or entering text. Consistency in these patterns reduces errors and speeds up task completion. When a text field validates input as the user types, they receive immediate feedback and can correct mistakes before proceeding. When a drop-down menu appears in the same place across different screens, the user builds muscle memory and can navigate without looking closely. Each pattern reflects a trade-off between flexibility, efficiency, and simplicity, and the designer must choose patterns that best serve the user group’s needs and context of use.


Visual design provides the cues that guide attention and communicate hierarchy. Typography, spacing, color, and iconography work together to show what is most important on a screen. A clear visual contrast between primary and secondary actions draws the user’s eye to the button they most often need. Meaningful use of whitespace helps the user distinguish separate sections without adding distracting borders or labels. Accessible color choices ensure that users with visual impairments can still perceive key information, and text size and line height must adapt to different screen sizes and resolutions.


Underlying all of these concerns is a process of validation and iteration. Designers employ usability testing, either in person or remotely, to observe real users as they try to complete tasks. These sessions reveal unexpected points of friction and misconceptions in the mental model. The findings then feed back into wireframes, mockups, or prototypes for further refinement. Without this cycle of testing, even the most rigorous upfront planning can lead to flawed assumptions that only surface when the product reaches users in real contexts.


CAPABILITIES OF LARGE LANGUAGE MODELS IN UX DESIGN


Large language models bring a rich set of capabilities that align surprisingly well with many of the tasks encountered in user experience design. Their primary strength lies in the ability to understand and generate natural language. However, that ability translates into a variety of practical skills—from analyzing feedback, to generating copy, to even recommending interface flows. To understand their role in UX, we must analyze these skills across concrete tasks where they can act as effective design assistants.


The first and perhaps most immediate application is the generation and refinement of textual copy. Interface text—whether it be labels, tooltips, empty states, or onboarding instructions—plays a crucial role in guiding the user. However, writing good UX copy requires brevity, clarity, tone alignment, and contextual awareness. A language model, when prompted with the function of a feature and the voice of the product, can generate candidate text snippets that match the required tone and format. Moreover, the model can offer alternative phrasings or translations, providing quick iteration loops for multilingual products.


Let us now examine a short code example in Python using the OpenAI API, where the model generates button labels for a search interface. First, we define the task and prompt the model with the intended function.


import openai


def generate_ui_labels(prompt_description, style_hint):

    response = openai.ChatCompletion.create(

        model="gpt-4",

        messages=[

            { "role": "system", "content": "You are an expert UX writer." },

            { "role": "user", "content": f"Write concise and user-friendly button labels for this function: {prompt_description}. Use the tone: {style_hint}." }

        ],

        temperature=0.5

    )

    return response.choices[0].message.content.strip()


print(generate_ui_labels("Search a product catalog and apply multiple filters", "professional but accessible"))



This small function demonstrates how the model can be integrated into a design tool or developer environment to support consistent UI labeling. The system prompt configures the model as a UX writer, while the user prompt defines the task and tone.


Beyond language generation, another major strength of LLMs is their capacity for summarization and clustering. When user feedback is collected in free-text form—through surveys, support tickets, or app store reviews—the volume often overwhelms manual analysis. Models can parse this input and produce structured insights, such as recurring complaints, suggested features, or misunderstood workflows. When such analysis is performed periodically, designers are able to track whether product changes are improving sentiment or exacerbating frustration.


Here is an example that summarizes user feedback snippets:


feedback_samples = [

    "It’s really hard to find the filter option on mobile.",

    "I couldn't figure out how to save my preferences.",

    "Why does the filter reset every time I switch tabs?",

    "Saving settings should be easier."

]


def summarize_feedback(samples):

    joined_text = "\n".join(samples)

    response = openai.ChatCompletion.create(

        model="gpt-4",

        messages=[

            { "role": "system", "content": "You are an expert UX researcher." },

            { "role": "user", "content": f"Summarize key UX issues from this feedback:\n{joined_text}" }

        ],

        temperature=0.3

    )

    return response.choices[0].message.content.strip()


print(summarize_feedback(feedback_samples))


This script sends a batch of feedback samples to the model and receives a distilled summary of usability problems. Designers can integrate this into dashboards or internal tools to keep a pulse on user pain points.


The third area where language models excel is in ideation and pattern matching. Given a description of a workflow or a business goal, the model can suggest common interface patterns used in similar contexts. For instance, if the goal is to let users book an appointment and receive a confirmation, the model might recommend a calendar picker, a progress bar for steps, and a summary screen before submission. While such suggestions must be evaluated by human designers, they offer a starting point and expose teams to ideas they might otherwise overlook.


It is important to note that these capabilities are not flawless. Models may invent plausible but incorrect patterns or use outdated terminology. For this reason, they must be used as supportive assistants rather than authoritative designers. Their utility increases when their output is constrained, reviewed, and supplemented by domain-specific rules or style guides.


TECHNICAL SPECIFICATIONS FOR INTEGRATION


Integrating large language models into user experience design workflows requires more than just calling an API. To use these models reliably, safely, and productively, engineers must establish a technical foundation that addresses model access, performance, latency, context size, privacy, and cost control. The specifics will vary depending on whether a team uses a cloud-hosted API such as OpenAI, a private deployment on a platform like Azure or AWS, or a local model served on-premises using frameworks such as llama.cpp or vLLM.


The first design choice concerns the model itself. If the primary tasks involve short text generation, such as labels or feedback summaries, a medium-sized model such as GPT-3.5 or Mistral 7B may suffice. For more nuanced reasoning—especially when clustering, rephrasing, or interpreting ambiguous user input—a stronger model like GPT-4, Claude Opus, or Qwen-14B-Chat may be necessary. Local deployment gives full control over inference latency, but it requires careful memory planning. For example, running a 13B parameter model at 16-bit precision typically needs around 24 GB of RAM or GPU VRAM, while quantized models can reduce this to under 8 GB with minimal loss in performance.


Engineers must also consider inference backends. When deploying locally, hardware support for CUDA or Apple’s MPS can dramatically accelerate generation. If those are unavailable, CPU inference is possible but will impose higher latencies. In collaborative design environments where real-time interaction is desired—for instance, a design assistant embedded in Figma or VS Code—these latencies become highly visible and must be optimized through caching, warm-start sessions, or parallel batch handling.


Next comes prompt construction and template design. Effective prompting is a critical part of the system architecture because even the best models behave erratically when under-specified or inconsistently formatted. Engineers should abstract prompt templates as reusable functions, possibly with Jinja or other template engines, and pass metadata (such as tone, length constraints, and domain terminology) alongside the main task description. This helps enforce consistency and supports testing prompt variations without touching business logic.


Context size plays another important role. Summarizing dozens or hundreds of user comments, for example, requires models that support extended context windows. Modern models now offer windows of 8k to 128k tokens. However, using such models increases cost and memory requirements, so engineers must implement fallback strategies—such as batching feedback and using local TF-IDF filtering to extract representative samples before summarization.


Below is an example of such a fallback filter for large input documents:


from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import numpy as np


def select_representative_feedback(samples, k=3):

    vectorizer = TfidfVectorizer(stop_words="english")

    tfidf_matrix = vectorizer.fit_transform(samples)

    similarity_matrix = cosine_similarity(tfidf_matrix)

    scores = similarity_matrix.mean(axis=1)

    top_indices = np.argsort(scores)[-k:]

    return [samples[i] for i in top_indices]


feedback = [

    "Too many steps to delete an account.",

    "Can't find the settings easily.",

    "It crashes on login when using VPN.",

    "Why do I have to enter my password twice?",

    "Unclear what the 'merge' button does."

]


selected = select_representative_feedback(feedback)

for entry in selected:

    print("- " + entry)



This example selects the most interconnected feedback samples using TF-IDF cosine similarity before passing them to a language model for summarization, thus reducing token usage while preserving key themes.


Finally, integration must ensure proper handling of data privacy and error recovery. When summarizing or analyzing user feedback, personally identifiable information may be involved. Developers must implement masking, anonymization, or client-side processing for sensitive content. They should also implement retry logic, model switching (e.g., fallback from GPT-4 to GPT-3.5), and cost-limiting circuits that protect both users and budgets in case of degraded API performance or misconfiguration.


ILLUSTRATIVE CODE EXAMPLES


In this section, we turn from architectural theory to concrete implementation. We walk through a complete, working example that demonstrates how to gather user feedback, process it locally, use a large language model to summarize issues, and present the findings in a manner suitable for a product team. Each part is annotated in detail so the reader can follow the design decisions.


Let us begin with a basic Flask-based web service that collects user feedback through an HTML interface. The server will then preprocess this feedback, summarize it using an LLM, and expose the summary on a dashboard. For this example, we use the OpenAI API for simplicity, but it may be replaced by any model with a chat completion interface.


First, the app.py sets up the full pipeline.


from flask import Flask, request, render_template, redirect, url_for

import openai

import os

from feedback_processor import summarize_feedback


app = Flask(__name__)

feedback_storage = []


@app.route("/", methods=["GET", "POST"])

def collect_feedback():

    if request.method == "POST":

        comment = request.form.get("comment", "").strip()

        if comment:

            feedback_storage.append(comment)

        return redirect(url_for("collect_feedback"))

    return render_template("form.html")


@app.route("/summary")

def show_summary():

    if not feedback_storage:

        return "No feedback available yet."

    summary = summarize_feedback(feedback_storage)

    return f"<h2>UX Summary</h2><p>{summary}</p>"



This script does two things. It collects user comments via a form (defined below) and stores them in a local list. Then it defines a second route, /summary, which runs the feedback through a summarization function and renders the result as HTML. In a real system, feedback would be persisted and authenticated, but this example focuses on the LLM flow.


Next, we define feedback_processor.py which contains the logic for summarization. This file contains prompt engineering, filtering, and model calling.


import openai


def summarize_feedback(samples):

    if len(samples) > 10:

        samples = samples[-10:]  # Take last 10 entries to stay within token limits

    joined = "\n".join(f"- {s}" for s in samples)

    prompt = (

        "You are an expert in UX research. Summarize the key usability issues, feature requests, "

        "and pain points from the following user feedback:\n\n" + joined

    )

    try:

        response = openai.ChatCompletion.create(

            model="gpt-4",

            messages=[{"role": "user", "content": prompt}],

            temperature=0.3

        )

        return response.choices[0].message.content.strip()

    except Exception as e:

        return f"Error summarizing feedback: {e}"



This logic takes the last ten feedback entries, joins them into a readable format, constructs a focused prompt with domain-specific role instructions, and sends it to the model. The result is a distilled, human-readable summary suitable for informing UX planning.


Now, we define the HTML input form templates/form.html:


<!DOCTYPE html>

<html>

<head><title>Feedback Form</title></head>

<body>

  <h2>Submit Your Feedback</h2>

  <form method="POST">

    <textarea name="comment" rows="5" cols="60" placeholder="Tell us what confused you..."></textarea><br>

    <input type="submit" value="Submit Feedback">

  </form>

  <p><a href="/summary">View UX Summary</a></p>

</body>

</html>



This form lets users submit comments through a basic interface. The summary link at the bottom leads to the real-time analysis.


To run this system:

1. Place the files app.py, feedback_processor.py, and the templates/form.html in a directory.

2. Set your OpenAI key using an environment variable or by assigning openai.api_key.

3. Start the server using python app.py.

4. Navigate to http://localhost:5000 in your browser to begin submitting and summarizing feedback.


This example illustrates how an LLM can function as a UX co-pilot: it synthesizes high-volume, low-structure input into actionable summaries. It complements quantitative telemetry and usability testing by providing human-like pattern recognition at scale.


LIMITATIONS AND FUTURE DIRECTIONS


Although large language models offer substantial value in the realm of UX design, their current capabilities remain bounded by well-known limitations. These limitations are not merely minor technical quirks; they pose serious challenges that must be actively mitigated by thoughtful system design, domain expertise, and human oversight. In this section, we examine the most critical failure modes, discuss their origins, and propose ways forward—both in practice and in research.


First, LLMs often produce confident but incorrect suggestions, especially when dealing with rare UI conventions, brand-specific tone, or domain-specific terminology. A model might propose a cheerful onboarding message for a tax audit tool, or recommend a carousel when accessibility guidelines warn against it. Such hallucinations arise because the model lacks grounding in real data or consequences—it simply infers from statistical patterns. Without constraints such as tone guides, visual design systems, or approval workflows, model-generated content may deviate from user expectations or even contradict inclusive design principles.


Second, current models do not possess persistent memory or user intent modeling across sessions. If you ask the model to rewrite tooltips for a settings page, and then ask it to generate the first-time user experience for that same page, it will not remember the context of the earlier interaction unless you provide it explicitly in the prompt. This statelessness limits its ability to function as a long-term design collaborator unless engineers build scaffolding around the model to maintain state, such as embedding memory graphs, caching embeddings, or storing conversation context in vector databases.


Third, models are insensitive to visual design and layout unless paired with vision capabilities. A purely text-based LLM cannot infer that a button is misaligned, a dropdown overlaps with a form field, or that there is too much whitespace between elements. Even multimodal models that support images—like GPT-4o, Gemini, or Claude with vision—struggle with precise pixel-based reasoning and currently lack design-system awareness. Thus, UX designers must rely on traditional Figma or Sketch feedback tools for such issues, at least for now.


Fourth, ethical and regulatory constraints are often underrepresented in LLM output. For instance, a model might suggest language that unintentionally violates accessibility standards, omits legally mandated disclosures, or uses biased gendered phrasing. These problems cannot be resolved through prompt tuning alone. Instead, teams must integrate bias audits, human review checkpoints, and possibly post-processing filters that catch problematic constructs before they reach production interfaces.


Fifth, the cost of using advanced LLMs in continuous design workflows can be non-trivial. Summarizing large batches of user feedback, generating localized text variants, or brainstorming copy in many styles all consume tokens. For cloud-based APIs, this incurs direct cost. For local deployments, it demands substantial compute and memory. Engineers must therefore implement caching, deduplication, and intelligent batching to keep usage sustainable. Hybrid models—where a smaller local model handles basic phrasing while a cloud model refines or reviews—may reduce costs without sacrificing quality.


Looking ahead, there are several promising directions. Integrating models with structured UX ontologies, design systems, or interaction grammars could constrain generation to contextually valid outputs. Memory-augmented prompting, where a model is supplied with past decisions, test results, and style guidance, could foster more coherent multi-step collaboration. Multimodal reasoning may improve as models become better at parsing visual UI sketches or interacting with design tools through APIs. Finally, tool-using agents that dynamically query UX guidelines, perform accessibility checks, and retrieve similar past designs may turn LLMs into powerful assistants with real design literacy.


However, none of these paths remove the need for human judgment. The job of a UX designer is fundamentally about empathy, context, and ethics—areas where machine intelligence still lacks grounding. LLMs may assist, accelerate, and even inspire, but they must not become oracles.



MULTIMODAL LANGUAGE MODELS AND USER EXPERIENCE


While text-based LLMs offer powerful tools for natural language processing, they are inherently limited when it comes to understanding or generating visual content. This is a significant shortcoming in the domain of user experience, where layout, visual hierarchy, iconography, and graphical balance play as critical a role as text. Multimodal language models—also referred to as Vision-Language Models (VLMs)—extend the capabilities of standard LLMs by allowing them to process images alongside text, enabling new classes of UX reasoning.


These models can perform tasks such as interpreting wireframes, providing visual accessibility feedback, generating captions or alt text for UI elements, or even evaluating the visual consistency of a prototype. For instance, a designer might upload a screenshot of a form and ask the model if all labels are aligned, or whether the contrast between the background and foreground elements meets accessibility guidelines. Although these models are not yet perfect, they represent a significant leap toward contextually aware UX assistance.


To use a VLM in practice, one typically sends both an image (such as a JPEG of a Figma mockup) and a corresponding text instruction to the model. The model then reasons jointly over the visual and linguistic inputs. Here is an abstracted Python example using a multimodal model such as GPT-4o:


import openai

import base64


def analyze_ui_image(image_path, prompt_instruction):

    with open(image_path, "rb") as img_file:

        image_bytes = img_file.read()


    response = openai.ChatCompletion.create(

        model="gpt-4o",

        messages=[

            {"role": "user", "content": [

                {"type": "text", "text": prompt_instruction},

                {"type": "image_url", "image_url": {"url": "data:image/png;base64," + base64.b64encode(image_bytes).decode()}}

            ]}

        ]

    )

    return response.choices[0].message.content.strip()



This code assumes the model supports multimodal input. It prepares the image, encodes it, and submits it alongside the instruction. The prompt might be as simple as “Evaluate this screen for visual alignment issues and label clarity.” The result is a textual diagnostic that combines natural language understanding with visual pattern recognition.


Multimodal models can also help automate documentation. By reading a UI and suggesting associated documentation snippets, usage examples, or accessibility notes, they can streamline workflows that typically require both designer and technical writer collaboration.


That said, multimodal UX support has notable constraints. These models do not yet have deep design-system awareness, nor can they simulate pixel-accurate rendering under different screen sizes or browsers. They cannot directly manipulate design files or provide guaranteed WCAG compliance. Still, as they grow more capable, they offer a promising way to bring visual and linguistic reasoning into a single design-assistance tool.


In summary, multimodal LLMs extend the reach of AI-based UX support into domains where visuals are central. While not a substitute for interactive design tools, they can act as reviewers, copilots, or brainstorming partners when evaluating or generating user interfaces that combine text with layout.


CONCLUSION


The rise of large language models introduces a novel opportunity for user experience design—not as a replacement for human designers, but as a tool that augments their work with generative insight, fast iteration, and scalable analysis. These models shine in domains where language and structure intersect. They can draft interface copy, summarize user feedback, suggest design patterns, and respond in real-time to prompts about usability or tone. In this sense, they extend the expressive range of UX professionals and allow engineering teams to prototype and refine designs more efficiently.


Yet the deployment of LLMs in UX workflows demands careful engineering. Prompt templates must be crafted with precision. Context must be preserved across tasks. Results must be filtered, validated, and integrated into existing design systems. The infrastructure must scale, not just in compute, but in ethical oversight, user privacy protection, and cost control. Model suggestions must remain subject to human review, particularly where accessibility, clarity, and regulatory requirements are concerned.


The illustrative examples presented in this article demonstrate how LLMs can be integrated into a feedback summarization loop or a UX writing tool. These prototypes may seem modest, but they embody a broader shift in how design knowledge is externalized and operationalized. No longer limited to static documentation or personal expertise, design principles can be made dynamic and responsive—encoded into systems that learn, adapt, and collaborate.


Still, we must not conflate convenience with correctness. A model that suggests a neat copy variant is not necessarily making a better UX decision. Only humans can understand the emotions, uncertainties, and expectations of real users. Models can assist, but they do not empathize. They can simulate consensus, but not experience friction. They can repeat best practices, but not define new ones.


As LLMs evolve, their role in UX design will likely grow. Multimodal reasoning, memory persistence, and tool integration may transform them from passive responders into active collaborators. In time, the practice of design may become more like conversation—a dialogue not only between team members, but also with intelligent agents that understand design goals and help realize them.


Until then, the best results will come not from replacing designers with models, but from pairing them—so that one proposes, the other critiques, and both improve.

No comments: