How to build an LLM workflow with Promptflow and OpenAI (with review and tracing)

This guide provides detailed instructions on how to build a trackable and evaluable LLM workflow using Promptflow, Prompty, and OpenAI.

This tutorial demonstrates how to build a complete, production-style LLM workflow using Promptflow in a Colab environment. The process begins with setting up a stable keyring backend to avoid operating system dependency issues, and configuring a secure connection with OpenAI.

Next, the system establishes a clear workspace and defines a Prompty file that serves as the core LLM component in the pipeline. On this foundation, a class-based flow is built, combining deterministic preprocessing and LLM reasoning capabilities. This approach allows for the insertion of pre-calculated 'hints' into the model's response.

The system also features tracing to track the details of each execution step. The workflow is then tested with both single and batch queries, and the results are output in a structured format. Finally, the pipeline is extended with an evaluation system, where an LLM acts as a 'judge' to score the responses based on expected outcomes.

Setting up the OpenAI environment and connection.

!pip install -q keyrings.alt import keyring from keyrings.alt.file import PlaintextKeyring keyring.set_keyring(PlaintextKeyring()) import os from promptflow.client import PFClient from promptflow.connections import OpenAIConnection pf = PFClient() CONN = "open_ai_connection" try: pf.connections.get(name=CONN) print(f"Using existing connection '{CONN}'") except Exception: pf.connections.create_or_update( OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"]) ) print(f"Created connection '{CONN}'")

The process begins with installing a backup keyring backend to avoid environment dependency errors, especially in Colab. Then, the Promptflow client is initialized and checks whether a connection to OpenAI already exists.

If it doesn't already exist, the system will create a new connection using an API key obtained from an environment variable, ensuring consistency and reusability in the future.

Next, all the necessary Promptflow libraries are installed, and the project's working directory is set up. The OpenAI API key is securely configured if it doesn't already exist, and then the environment is reconfigured to ensure all components are working correctly.

!pip install -q "promptflow>=1.13.0" "promptflow-tracing" "promptflow-tools" openai import os, sys, json, getpass, textwrap, importlib from pathlib import Path if "OPENAI_API_KEY" not in os.environ: os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key: ") WORK_DIR = Path("/content/pf_demo"); WORK_DIR.mkdir(exist_ok=True, parents=True) os.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR)) from promptflow.client import PFClient from promptflow.connections import OpenAIConnection from promptflow.tracing import start_trace pf = PFClient() CONN = "open_ai_connection" try: pf.connections.get(name=CONN); print(f"Using existing connection '{CONN}'") except Exception: pf.connections.create_or_update(OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"])) print(f"Created connection '{CONN}'")

Develop prompts and processing flows.

(WORK_DIR / "researcher.prompty").write_text("""--- name: Researcher description: Concise research assistant. model: api: chat configuration: type: openai connection: open_ai_connection model: gpt-4o-mini parameters: temperature: 0.2 max_tokens: 350 inputs: question: {type: string} hint: {type: string, default: ""} sample: question: "What is the speed of light in vacuum?" hint: "" --- system: You are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in. user: Q: {{question}} {% if hint %}Hint: {{hint}}{% endif %} """) (WORK_DIR / "flow.py").write_text(textwrap.dedent(''' from pathlib import Path from promptflow.tracing import trace from promptflow.core import Prompty BASE = Path(__file__).parent @trace def safe_calc(expression: str) -> str: """A tiny deterministic 'tool' the assistant can lean on.""" if not set(expression) <= set("0123456789+-*/(). "): return "unsafe" try: return str(eval(expression)) except Exception as e: return f"error:{e}" class ResearchAssistant: """Class-based flex flow. __init__ args become flow init parameters.""" def __init__(self, model: str = "gpt-4o-mini"): self.model = model self.llm = Prompty.load(source=BASE / "researcher.prompty") @trace def __call__(self, question: str) -> dict: hint = "" if "*" in question or "+" in question: tokens = [t for t in question.replace("?","").split() if any(c.isdigit() for c in t)] expr = "".join(tokens) if expr: hint = f"computed: {expr} = {safe_calc(expr)}" answer = self.llm(question=question, hint=hint) return {"question": question, "answer": str(answer).strip(), "hint_used": hint} ''')) (WORK_DIR / "flow.flex.yaml").write_text( "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn" "entry: flow:ResearchAssistantn" )

A Prompty file is defined to describe how the LLM functions as a concise, clearly structured research assistant.

Subsequently, a class-based flow is created, combining deterministic computation tools and calls to the LLM. This design allows the system to perform 'hybrid' reasoning, where part of the logic is handled by code, and the rest by the LLM.

This flow is registered via a YAML configuration file, allowing it to be executed directly within the Promptflow ecosystem.

try: start_trace() except Exception as e: print("trace ui unavailable on Colab — traces still recorded:", e) import flow as _flow; importlib.reload(_flow) agent = _flow.ResearchAssistant(model="gpt-4o-mini") print("n=== Single call ===") print(json.dumps(agent(question="In one sentence, what is photosynthesis?"), indent=2)) print(json.dumps(agent(question="What is 21 * 19 ?"), indent=2)) data = [ {"question": "What is the capital of France?", "expected": "Paris"}, {"question": "Chemical symbol for gold?", "expected": "Au"}, {"question": "Who wrote the play Hamlet?", "expected": "Shakespeare"}, {"question": "What is 12 * 11 ?", "expected": "132"}, {"question": "Boiling point of water at sea level (C)?","expected": "100"}, {"question": "Largest planet in our solar system?", "expected": "Jupiter"}, ] data_path = WORK_DIR / "data.jsonl" data_path.write_text("n".join(json.dumps(r) for r in data)) print("n=== Batch run ===") base_run = pf.run( flow=str(WORK_DIR / "flow.flex.yaml"), data=str(data_path), column_mapping={"question": "${data.question}"}, stream=True, ) print(pf.get_details(base_run))

Run a test and monitor the workflow.

Tracing is enabled to record the entire execution process. The research assistant flow is then initialized and tested with individual queries to ensure the system handles both natural language and mathematical operations correctly.

After confirming stable operation, a dataset is prepared for batch testing. Promptflow processes this batch and returns the results as structured data, ready for the next evaluation step.

(WORK_DIR / "judge.prompty").write_text("""--- name: Judge model: api: chat configuration: type: openai connection: open_ai_connection model: gpt-4o-mini parameters: temperature: 0 max_tokens: 150 response_format: {type: json_object} inputs: question: {type: string} answer: {type: string} expected: {type: string} --- system: You are an exacting grader. Decide whether the assistant's answer contains the expected fact (case-insensitive, allowing reasonable phrasing/synonyms). Reply ONLY as JSON: {"score": 0 or 1, "reason": "."}. user: Question: {{question}} Expected: {{expected}} Answer: {{answer}} """) (WORK_DIR / "eval_flow.py").write_text(textwrap.dedent(''' import json from pathlib import Path from promptflow.tracing import trace from promptflow.core import Prompty BASE = Path(__file__).parent class Evaluator: def __init__(self): self.judge = Prompty.load(source=BASE / "judge.prompty") @trace def __call__(self, question: str, answer: str, expected: str) -> dict: raw = self.judge(question=question, answer=answer, expected=expected) if isinstance(raw, str): try: raw = json.loads(raw) except Exception: raw = {"score": 0, "reason": f"unparseable:{raw[:80]}"} return {"score": int(raw.get("score", 0)), "reason": str(raw.get("reason",""))} def __aggregate__(self, line_results): """Run-level aggregation. Whatever this returns shows up in pf.get_metrics().""" scores = [r["score"] for r in line_results if r] return { "accuracy": (sum(scores) / len(scores)) if scores else 0.0, "passed": sum(scores), "total": len(scores), } ''')) (WORK_DIR / "eval.flex.yaml").write_text( "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn" "entry: eval_flow:Evaluatorn" ) print("n=== Evaluation run ===") eval_run = pf.run( flow=str(WORK_DIR / "eval.flex.yaml"), data=str(data_path), run=base_run, column_mapping={ "question": "${data.question}", "expected": "${data.expected}", "answer": "${run.outputs.answer}", }, stream=True, ) eval_details = pf.get_details(eval_run) print(eval_details) print("n=== Aggregated metrics (from __aggregate__) ===") print(json.dumps(pf.get_metrics(eval_run), indent=2)) import pandas as pd if "outputs.score" in eval_details.columns: s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0) print(f"Manual accuracy: {s.mean():.2%} ({int(s.sum())}/{len(s)})")

Another prompt is created to act as a 'judge,' responsible for evaluating the model's output against the expected answer. The results are returned as structured JSON.

Next, an evaluator layer is implemented to analyze the results, calculate scores, and aggregate the evaluation indicators. The system also supports aggregation methods to generate an overall metric.

The evaluation pipeline runs in parallel with the main pipeline, directly linked to the previous run. Accuracy is calculated using Promptflow metrics, and there is a fallback mechanism for manual checking when needed.

Conclude

The workflow built in this guide goes beyond simply sending prompts and receiving feedback. Instead, it's a complete, well-structured, scalable, and controllable LLM system.

Combining deterministic processing tools, structured prompts, and reusable flows makes the system more transparent and flexible. Adding batch execution and evaluation pipelines creates a clear feedback loop, enabling performance measurement based on accuracy and detailed analysis.

Furthermore, integrating tracing and aggregation functions makes debugging, monitoring, and improvement more efficient. This is a prime example of how to build end-to-end LLM applications with a solid foundation in structure, evaluation, and reproducibility.

How to build an LLM workflow with Promptflow and OpenAI (with review and tracing)

Setting up the OpenAI environment and connection.

Develop prompts and processing flows.

Run a test and monitor the workflow.

Conclude

What is Ray Tracing?

Instructions on how to add and use OpenAI Nodes in n8n

Everything you need to know about OpenAI

Beautifully designed alphabet and number tracing practice book for 5-year-olds (from 1 to 10).

Guide to creating AI images automatically with n8n and OpenAI

Build a Claude Dispatch workflow on your mobile device.

Detailed instructions on how to create ChatGPT API Keys to connect n8n and automation platforms.

5 tips to turn OpenAI Codex into a more powerful AI Coding Agent

The essentials and basic uses of Apple's new free app

What is Ray Tracing? Learn about Ray Tracing technology in the game

It's 2024! Can AMD graphics cards surpass Nvidia in ray tracing capabilities?

Impressive mod Ray-Traced Half-Life 1

GPT-4o API Guide: Getting Started with OpenAI's API

What does the Ray Tracing technology on the GeForce RTX 20x series mean to gamers?

What is a review? Job experience test for people

100 billion USD to build the most expensive supercomputer facility in the world