How to build an LLM workflow with Promptflow and OpenAI (with review and tracing)

This guide provides detailed instructions on how to build a trackable and evaluable LLM workflow using Promptflow, Prompty, and OpenAI.

This tutorial demonstrates how to build a complete, production-style LLM workflow using Promptflow in a Colab environment. The process begins with setting up a stable keyring backend to avoid operating system dependency issues, and configuring a secure connection with OpenAI.

 

Next, the system establishes a clear workspace and defines a Prompty file that serves as the core LLM component in the pipeline. On this foundation, a class-based flow is built, combining deterministic preprocessing and LLM reasoning capabilities. This approach allows for the insertion of pre-calculated 'hints' into the model's response.

The system also features tracing to track the details of each execution step. The workflow is then tested with both single and batch queries, and the results are output in a structured format. Finally, the pipeline is extended with an evaluation system, where an LLM acts as a 'judge' to score the responses based on expected outcomes.

Setting up the OpenAI environment and connection.

!pip install -q keyrings.alt import keyring from keyrings.alt.file import PlaintextKeyring keyring.set_keyring(PlaintextKeyring()) import os from promptflow.client import PFClient from promptflow.connections import OpenAIConnection pf = PFClient() CONN = "open_ai_connection" try: pf.connections.get(name=CONN) print(f"Using existing connection '{CONN}'") except Exception: pf.connections.create_or_update( OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"]) ) print(f"Created connection '{CONN}'")

 

The process begins with installing a backup keyring backend to avoid environment dependency errors, especially in Colab. Then, the Promptflow client is initialized and checks whether a connection to OpenAI already exists.

If it doesn't already exist, the system will create a new connection using an API key obtained from an environment variable, ensuring consistency and reusability in the future.

Next, all the necessary Promptflow libraries are installed, and the project's working directory is set up. The OpenAI API key is securely configured if it doesn't already exist, and then the environment is reconfigured to ensure all components are working correctly.

!pip install -q "promptflow>=1.13.0" "promptflow-tracing" "promptflow-tools" openai import os, sys, json, getpass, textwrap, importlib from pathlib import Path if "OPENAI_API_KEY" not in os.environ: os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key: ") WORK_DIR = Path("/content/pf_demo"); WORK_DIR.mkdir(exist_ok=True, parents=True) os.chdir(WORK_DIR); sys.path.insert(0, str(WORK_DIR)) from promptflow.client import PFClient from promptflow.connections import OpenAIConnection from promptflow.tracing import start_trace pf = PFClient() CONN = "open_ai_connection" try: pf.connections.get(name=CONN); print(f"Using existing connection '{CONN}'") except Exception: pf.connections.create_or_update(OpenAIConnection(name=CONN, api_key=os.environ["OPENAI_API_KEY"])) print(f"Created connection '{CONN}'")

Develop prompts and processing flows.

(WORK_DIR / "researcher.prompty").write_text("""--- name: Researcher description: Concise research assistant. model: api: chat configuration: type: openai connection: open_ai_connection model: gpt-4o-mini parameters: temperature: 0.2 max_tokens: 350 inputs: question: {type: string} hint: {type: string, default: ""} sample: question: "What is the speed of light in vacuum?" hint: "" --- system: You are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in. user: Q: {{question}} {% if hint %}Hint: {{hint}}{% endif %} """) (WORK_DIR / "flow.py").write_text(textwrap.dedent(''' from pathlib import Path from promptflow.tracing import trace from promptflow.core import Prompty BASE = Path(__file__).parent @trace def safe_calc(expression: str) -> str: """A tiny deterministic 'tool' the assistant can lean on.""" if not set(expression) <= set("0123456789+-*/(). "): return "unsafe" try: return str(eval(expression)) except Exception as e: return f"error:{e}" class ResearchAssistant: """Class-based flex flow. __init__ args become flow init parameters.""" def __init__(self, model: str = "gpt-4o-mini"): self.model = model self.llm = Prompty.load(source=BASE / "researcher.prompty") @trace def __call__(self, question: str) -> dict: hint = "" if "*" in question or "+" in question: tokens = [t for t in question.replace("?","").split() if any(c.isdigit() for c in t)] expr = "".join(tokens) if expr: hint = f"computed: {expr} = {safe_calc(expr)}" answer = self.llm(question=question, hint=hint) return {"question": question, "answer": str(answer).strip(), "hint_used": hint} ''')) (WORK_DIR / "flow.flex.yaml").write_text( "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn" "entry: flow:ResearchAssistantn" )

 

A Prompty file is defined to describe how the LLM functions as a concise, clearly structured research assistant.

Subsequently, a class-based flow is created, combining deterministic computation tools and calls to the LLM. This design allows the system to perform 'hybrid' reasoning, where part of the logic is handled by code, and the rest by the LLM.

This flow is registered via a YAML configuration file, allowing it to be executed directly within the Promptflow ecosystem.

try: start_trace() except Exception as e: print("trace ui unavailable on Colab — traces still recorded:", e) import flow as _flow; importlib.reload(_flow) agent = _flow.ResearchAssistant(model="gpt-4o-mini") print("n=== Single call ===") print(json.dumps(agent(question="In one sentence, what is photosynthesis?"), indent=2)) print(json.dumps(agent(question="What is 21 * 19 ?"), indent=2)) data = [ {"question": "What is the capital of France?", "expected": "Paris"}, {"question": "Chemical symbol for gold?", "expected": "Au"}, {"question": "Who wrote the play Hamlet?", "expected": "Shakespeare"}, {"question": "What is 12 * 11 ?", "expected": "132"}, {"question": "Boiling point of water at sea level (C)?","expected": "100"}, {"question": "Largest planet in our solar system?", "expected": "Jupiter"}, ] data_path = WORK_DIR / "data.jsonl" data_path.write_text("n".join(json.dumps(r) for r in data)) print("n=== Batch run ===") base_run = pf.run( flow=str(WORK_DIR / "flow.flex.yaml"), data=str(data_path), column_mapping={"question": "${data.question}"}, stream=True, ) print(pf.get_details(base_run))

Run a test and monitor the workflow.

Tracing is enabled to record the entire execution process. The research assistant flow is then initialized and tested with individual queries to ensure the system handles both natural language and mathematical operations correctly.

After confirming stable operation, a dataset is prepared for batch testing. Promptflow processes this batch and returns the results as structured data, ready for the next evaluation step.

(WORK_DIR / "judge.prompty").write_text("""--- name: Judge model: api: chat configuration: type: openai connection: open_ai_connection model: gpt-4o-mini parameters: temperature: 0 max_tokens: 150 response_format: {type: json_object} inputs: question: {type: string} answer: {type: string} expected: {type: string} --- system: You are an exacting grader. Decide whether the assistant's answer contains the expected fact (case-insensitive, allowing reasonable phrasing/synonyms). Reply ONLY as JSON: {"score": 0 or 1, "reason": "."}. user: Question: {{question}} Expected: {{expected}} Answer: {{answer}} """) (WORK_DIR / "eval_flow.py").write_text(textwrap.dedent(''' import json from pathlib import Path from promptflow.tracing import trace from promptflow.core import Prompty BASE = Path(__file__).parent class Evaluator: def __init__(self): self.judge = Prompty.load(source=BASE / "judge.prompty") @trace def __call__(self, question: str, answer: str, expected: str) -> dict: raw = self.judge(question=question, answer=answer, expected=expected) if isinstance(raw, str): try: raw = json.loads(raw) except Exception: raw = {"score": 0, "reason": f"unparseable:{raw[:80]}"} return {"score": int(raw.get("score", 0)), "reason": str(raw.get("reason",""))} def __aggregate__(self, line_results): """Run-level aggregation. Whatever this returns shows up in pf.get_metrics().""" scores = [r["score"] for r in line_results if r] return { "accuracy": (sum(scores) / len(scores)) if scores else 0.0, "passed": sum(scores), "total": len(scores), } ''')) (WORK_DIR / "eval.flex.yaml").write_text( "$schema: https://azuremlschemas.azureedge.net/promptflow/latest/Flow.schema.jsonn" "entry: eval_flow:Evaluatorn" ) print("n=== Evaluation run ===") eval_run = pf.run( flow=str(WORK_DIR / "eval.flex.yaml"), data=str(data_path), run=base_run, column_mapping={ "question": "${data.question}", "expected": "${data.expected}", "answer": "${run.outputs.answer}", }, stream=True, ) eval_details = pf.get_details(eval_run) print(eval_details) print("n=== Aggregated metrics (from __aggregate__) ===") print(json.dumps(pf.get_metrics(eval_run), indent=2)) import pandas as pd if "outputs.score" in eval_details.columns: s = pd.to_numeric(eval_details["outputs.score"], errors="coerce").fillna(0) print(f"Manual accuracy: {s.mean():.2%} ({int(s.sum())}/{len(s)})")

 

Another prompt is created to act as a 'judge,' responsible for evaluating the model's output against the expected answer. The results are returned as structured JSON.

Next, an evaluator layer is implemented to analyze the results, calculate scores, and aggregate the evaluation indicators. The system also supports aggregation methods to generate an overall metric.

The evaluation pipeline runs in parallel with the main pipeline, directly linked to the previous run. Accuracy is calculated using Promptflow metrics, and there is a fallback mechanism for manual checking when needed.

Conclude

The workflow built in this guide goes beyond simply sending prompts and receiving feedback. Instead, it's a complete, well-structured, scalable, and controllable LLM system.

Combining deterministic processing tools, structured prompts, and reusable flows makes the system more transparent and flexible. Adding batch execution and evaluation pipelines creates a clear feedback loop, enabling performance measurement based on accuracy and detailed analysis.

Furthermore, integrating tracing and aggregation functions makes debugging, monitoring, and improvement more efficient. This is a prime example of how to build end-to-end LLM applications with a solid foundation in structure, evaluation, and reproducibility.

Close
Category

System

Windows XP

Windows Server 2012

Windows 8

Windows 7

Windows 10

Wifi tips

Virus Removal - Spyware

Speed ​​up the computer

Server

Security solution

Mail Server

LAN - WAN

Ghost - Install Win

Fix computer error

Configure Router Switch

Computer wallpaper

Computer security

Mac OS X

Mac OS System software

Mac OS Security

Mac OS Office application

Mac OS Email Management

Mac OS Data - File

Mac hardware

Hardware

USB - Flash Drive

Speaker headset

Printer

PC hardware

Network equipment

Laptop hardware

Computer components

Advice Computer

Game

PC game

Online game

Mobile Game

Pokemon GO

information

Technology story

Technology comments

Quiz technology

New technology

British talent technology

Attack the network

Artificial intelligence

Technology

Smart watches

Raspberry Pi

Linux

Camera

Basic knowledge

Banking services

SEO tips

Science

Strange story

Space Science

Scientific invention

Science Story

Science photo

Science and technology

Medicine

Health Care

Fun science

Environment

Discover science

Discover nature

Archeology

Life

Travel Experience

Tips

Raise up child

Make up

Life skills

Home Care

Entertainment

DIY Handmade

Cuisine

Christmas

Application

Web Email

Website - Blog

Web browser

Support Download - Upload

Software conversion

Social Network

Simulator software

Online payment

Office information

Music Software

Map and Positioning

Installation - Uninstall

Graphic design

Free - Discount

Email reader

Edit video

Edit photo

Compress and Decompress

Chat, Text, Call

Archive - Share

Electric

Water heater

Washing machine

Television

Machine tool

Fridge

Fans

Air conditioning

Program

Unix and Linux

SQL Server

SQL

Python

Programming C

PHP

NodeJS

MongoDB

jQuery

JavaScript

HTTP

HTML

Git

Database

Data structure and algorithm

CSS and CSS3

C ++

C #

AngularJS

Mobile

Wallpapers and Ringtones

Tricks application

Take and process photos

Storage - Sync

Security and Virus Removal

Personalized

Online Social Network

Map

Manage and edit Video

Data

Chat - Call - Text

Browser and Add-on

Basic setup