This post relates to various sessions conducted as a part of 2 day virtual event “Packt Nexus 2025”.

Keynote Session: Evolution of LLMs from Foundations to Agents

Context

Setting the stage by exploring the evolution of AI, from large language models to agentic systems and beyond.

Speaker

AIOps for the EMEA region at IBM, where he leads the firm’s strategic deployment of AI-powered operations and IT infrastructure services across Europe, the Middle East and Africa. With a background spanning major technology and consulting roles, Horn helps enterprise clients move from experimental AI pilots to scalable, production-grade systems — emphasizing strong data governance, hybrid cloud architectures and operational resilience

Learnings

2025 is the year of AI agents ?
There is much more than a commercial motivation for agents
From Perception to Physical Autonomy
Perception AI
- Speech recognition
- Deep recsys
- Medical Imaging
Generative AI
- Digital marketing
- Content creation
Agentic AI: This is the real disruption. With Agentic AI, AI and LLM gets hand and feet
- Coding assistant
- Customer service
- Patient care
Physical AI
- Self driving cars, general robotis
Last three years everyone is using LLMs
From 2023, there are Compound AI Systems
- RAG
From 2024, there are Agentic Systems
IBM: In the last two years, 4000 RAG pilots for companies. 80 percent of POCs are in RAG systems.
- Not the use case that is satisfying
- With RAG, there are limitations.
Core limitations of LLMs
- No direct access to external systems
- No memory between sessions
- Restricted context window
- Stateless execution: lacks built-in flow control for multi-step tasks
- Gemini is 1million tokens
How Agents address LLM limitations
- Tool Integration
- External memory and RAG setups
- Sub-context splitting
- Agent frameworks like watsonx Orchestrate, LangChain, LangGraph, Autogen manage multi-step workflows and Tool execution
With AI Agents, LLM have hands and feet. Planning, decision-making and execution
- Real power lies in Planning, decision-making and execution
300 reasoning steps: chinese startup created an agent
Basic agentic use cases are coming up
German bank: 15 agentic usecases with in the company
- Do I need an AI agent ?
It is not a year of agents. It was more the year of agentic exploration where enterprises move beyond prototypes and START

Summary

Introduction

This keynote frames the current AI moment as a turning point: we are moving from systems that predict text to systems that can act, from models that respond to agents that reason, plan, and execute tasks.

The speaker structures the evolution of AI into a multi-year arc:

Perception AI Generative AI Agentic AI Physical AI (highlighted visually on page 4 of the deck) :contentReference[oaicite:0]{index=0}.

He argues that although 2025 is being marketed as “the year of AI agents,” the truth is more nuanced: 2025 is not the year of full adoption — it is the year enterprises START experimenting seriously with agents.

This summary explains:

How AI evolved to this point
What limitations of LLMs make agents necessary
How agents overcome those limitations
Why enterprises are experimenting, but not yet fully adopting
How deterministic workflows and agentic autonomy will blend
What a multi-agent enterprise architecture looks like
Why 2025 is the “year of exploration,” not mass deployment

1. Industry Context: “2025 Is the Year of AI Agents?” (Slide p.3)

The keynote opens with the cultural narrative: Tech leaders like Jensen Huang, Sam Altman, Bill Gates, and Satya Nadella have loudly claimed that 2025 is the year of AI agents.

The slide on page 3 shows headlines and quotes stating that agents will:

Transform workflows
Become the new computing interface
Bring the next revolution after touch interfaces
Become the primary way humans interact with computers :contentReference[oaicite:1]{index=1}

The speaker’s view, however, is more grounded:

The industry wants agents now. Technically and operationally, we are just beginning.

This sets the tone: The hype is ahead of the actual deployment curve.

2. Four-Stage Evolution of AI: From Perception to Physical Autonomy

The speaker presents a simple but powerful four-stage evolutionary arc (page 4) :contentReference[oaicite:2]{index=2}:

(1) Perception AI Systems that see, hear, and classify. Includes:

Speech recognition
Recommender systems
Medical imaging
Traditional deep learning models

These systems don’t generate or act — they perceive.

(2) Generative AI Models that create content in text, image, audio, code. Examples:

Content creation
Digital marketing
Creative tools

This is the phase we’ve lived through for the last 3 years.

(3) Agentic AI The real disruption. Generative models get hands and feet:

Coding assistants
Customer service agents
Patient-care copilots

Agents can:

Understand goals
Plan multi-step tasks
Use tools
Execute decisions

This is where enterprises are currently experimenting.

(4) Physical AI Robots and systems that act in the physical world:

Self-driving cars
Warehouse robotics
Collaborative factory robots

The keynote makes it clear: Agentic AI is the bridge between Generative and Physical AI. Agents will be the control layer that eventually directs physical systems.

3. The Enterprise Automation Curve (Slide p.5)

The deck includes a historical timeline of automation technologies (page 5) :contentReference[oaicite:3]{index=3}:

Rule-based workflows (2010s)
RPA (2015)
ML + NLP (2012 onward)
Workflow automation (2019)
IPA (Intelligent Process Automation, 2022)
GenAI assistants (2022–2023)
Agentic AI orchestration (2024–2025)

Each step increases:

Flexibility
Autonomy
Business value

Agentic AI represents the peak of this curve: A system that blends RPA’s reliability, NLP’s understanding, and GenAI’s generative capabilities — now with planning and execution.

4. From LLMs ? Compound AI ? Agentic Systems (Slide p.6)

The speaker breaks down three “eras” (page 6) :contentReference[oaicite:4]{index=4}:

Era 1: LLMs (2022)

Single model
Single turn
Predictive
Limited action

Era 2: Compound AI Systems (2023)

RAG
Tools
System prompts
Structured workflows
Still mostly deterministic

Era 3: Agentic Systems (2024 onward)

Planning
Execution
Tool orchestration
Reasoning loops
Memory + state
Multi-step goals

Your notes also capture the same thing: 2023 was the year of RAG pilots and compound systems. 2024 introduced the first real agentic patterns.

5. IBM’s View From the Field: 4000 RAG Pilots in 2 Years

A major insight from the keynote:

IBM has run over 4000 RAG pilots with clients in the last two years. 80% of all POCs are RAG-based.

But the conclusion from those pilots:

RAG is useful
RAG is not enough
RAG alone does not solve “real tasks”
RAG is not satisfying for many use cases

RAG answers questions. Enterprises need systems that complete tasks.

Hence: RAG ? Agents is the natural transition.

6. Why LLMs Alone Are Not Enough (Slide p.7)

The keynote highlights four key limitations of LLMs (page 7) :contentReference[oaicite:5]{index=5}:

LLM Limitation #1: No direct access to external systems They can’t:

Query APIs
Fetch real-time weather
Pull enterprise DB entries
Use tools like calculators or CRM
Access files on SharePoint

LLM Limitation #2: No memory Every chat starts from scratch. There is no continuity and no cumulative learning.

LLM Limitation #3: Restricted context window Even 1 million tokens (Gemini’s claim) is limited when:

You have full codebases
Large wikis
Multi-year policy documents

LLM Limitation #4: Stateless execution LLMs:

Don’t remember past steps
Have no flow control
Struggle with multi-step tasks
Cannot manage dependencies

This is the real architectural gap: LLMs are powerful brains, but terrible project managers.

Thus: *LLMs can generate steps. Agents *execute steps.**

7. How Agents Solve These Limitations (Slide p.7)

The second half of the slide (page 7) shows how agents fill each gap:

Tool Integration Agents can:

Call APIs
Run scripts
Execute workflows
Interact with enterprise systems

External Memory / RAG Not just retrieval — agents maintain:

User profile
Context state
Long-running task memory
Task history

Sub-context Splitting Agents chunk large documents, process them sequentially, and build summaries.

Agent Frameworks Systems like:

Watsonx Orchestrate
LangChain
LangGraph
Autogen

give:

Multi-step flow control
Tool execution
Retries
Error handling
Observability

This is the hands and feet of agentic AI: LLMs generate thoughts. Agents interpret, monitor, execute, and coordinate.

8. The Evolution of AI Agents (Slide p.8)

This slide neatly visualizes three stages of evolution (page 8) :contentReference[oaicite:6]{index=6}:

Stage 1: Zero-shot/few-shot prompting

You send a query.
Model returns a response.
No tools, no memory, no planning.

Stage 2: Fixed LLM flows (e.g., RAG)

Add vector search
Add context integration
Still linear
Still preset instructions

Stage 3: LLM Agent

LLM + Memory
Planning + decision-making
Execution
Tools, web access, applications
Multi-step workflows
Higher autonomy

This slide is one of the clearest representations of the keynote’s narrative: Agents = LLM + memory + tools + reasoning + execution loop

9. Degrees of Autonomy (Slide p.9)

One of the most important enterprise insights: Most organizations will use hybrid autonomy.

The slide (page 9) shows a spectrum from:

Fully deterministic tasks
Partially agentic tasks
Fully agentic processes :contentReference[oaicite:7]{index=7}

Deterministic

No agentic behavior
Predefined workflows
Good for compliance-heavy processes

Agentic Data Mapping

Agents handle mappings
Use templates and verified fields
Low risk

Agentic Tasks

Some autonomy
Agents route tasks
Do exception handling
Make optimized decisions

Full autonomy

Entire business processes run as agents
The agent chooses tools, sequences, and execution paths

The speaker emphasizes:

Enterprises will dial autonomy up/down like a volume knob
Very few will jump to “fully autonomous” quickly

10. The Complexity of Building Agents (Slide p.10)

The iceberg diagram on page 10 shows the hidden complexity behind agents:

Vector DBs
Tooling layers
Memory stores
Frameworks
Tracing and observability
Execution sandboxing
API gateways
Policy enforcement
Data governance

The visible part (the “cool agent”) is tiny. The invisible infrastructure is massive. This is why most organizations are still experimenting, not deploying at scale.

11. The Agent Market Explosion (Slide p.11)

The slide on page 11 shows dozens of agent startups across categories:

Coding agents
Sales agents
Customer support agents
Verticalized agents
Research agents
Marketing agents

The speaker’s message:

The market is saturated with prototypes
Very few are production-ready
Enterprises need standardization and reliability

12. Tool Fragmentation Problem (Slide p.12)

This slide (page 12) states:

Tools are small, purpose-built, and fragmented.

Problems:

Tightly coupled to specific frameworks
Inconsistent naming conventions
No shared schema
No standard prompting format
Tools run locally instead of as reusable cloud services
No shared discovery or marketplaces

Result:

Redundant implementations
High friction
Poor interoperability

Enterprises need:

Tool registries
Shared standards
Unified discovery layers

13. A Unified Layer for Enterprise Agents (Slide p.13)

This slide shows IBM’s conceptual architecture: A unified layer across:

Domain-specific orchestrators
Hyperscalers
Open source agents
RPA vendors
Custom enterprise agents
Horizontal integrations (chat, voice, events)

This is the “middleware layer” that allows:

Reuse
Security
Governance
Observability
Interoperability

In short: Enterprise agents cannot be built as isolated apps. They need an enterprise-wide orchestration layer.

14. Multi-Agent Enterprise Architecture (Slides p.14–15)

The example scenario (pages 14–15) shows a customer onboarding system involving:

Agents:

Onboarding agent
Company profiling agent
Document processing agent
Legal advisor agent
Customer verification agent

Inputs:

Chat
Voice
Slack
APIs
Documents
Policies

Tools:

Search
Data lookup
Document extraction
Risk scoring
Workflow automation

Orchestration:

Tools exposed as services
Agents exposed as tools
Multi-agent collaboration
Event triggers
Dialog flows

The architecture represents a realistic enterprise-grade multi-agent system:

Each agent is specialized
Agents communicate
Tools are shared
Policies and guidelines constrain behavior

This is not science fiction — it is the blueprint for how enterprises will deploy agents over the next 2–5 years.

15. Is 2025 the Year of AI Agents? (Slide p.16)

The final slide states clearly:

2025 is not the year of agents. 2025 is the year of agentic exploration. :contentReference[oaicite:8]{index=8}

Meaning:

Pilots begin
Architectures form
Enterprise patterns mature
Governance is defined
First production systems appear
Organizations build confidence

But:

Fully autonomous agents
Multi-agent ecosystems
Company-wide adoption

are still ahead.

16. Summary of Your Notes Integrated With the Slides

To blend your notes with the slides:

You noted that “2025 is the year of AI agents?” The keynote confirms the hype but argues it’s exploratory.
You captured the four eras: Perception GenAI Agentic Physical Matches slide p.4.
You noted compound AI (RAG) dominating 2023 Slide p.6 and IBM’s “4000 pilots” confirm this.
You listed LLM limitations (no memory, no external access) Slide p.7 covers the same in detail.
You noted how agents solve these limitations Exactly matched by slide p.7 and p.8.
You noted a German bank using 15 agentic use cases Reinforces the exploratory phase narrative.
You captured the idea that agents give AI “hands and feet” This is a core metaphor in the keynote.
You noted the “300 reasoning steps Chinese startup” Part of the growing capability trend.
You noted that the year is really about STARTING Slide p.16 makes the same point.

Your notes align extremely well with the storyline in the slides:

The hype
The constraints
The enterprise reality
The slow path toward autonomy

Final Takeaway

This keynote is not about selling the idea that agents are ready today. It is a sober explanation of where the technology truly stands — and why 2025 is the year enterprises begin serious exploration of agentic systems.

The core message:

LLMs gave enterprises intelligence. RAG gave them grounded knowledge. Agents will give them autonomy.

But autonomy comes slowly. Enterprises will:

Start with deterministic workflows
Add partial autonomy
Introduce tool-enabled reasoning
Adopt multi-agent architectures
And eventually converge on hybrid systems mixing predictability with intelligence

2025 is the year organizations begin the journey, not the year they fully arrive.

Tech Talk: Reasoning in LLMs

Context

This session will explore the core concepts of reasoning and decision-making in large language models. We’ll dive into practical techniques like ReAct-style feedback loops, where models are enabled to reason, seek external assistance, and then retry tasks. The discussion will cover how agents use reasoning engines to plan complex workflows and overcome the limitations of traditional, passive LLMs that struggled with multi-step tasks

Speaker

Nicole Königstein CEO and Co-Chief AI Officer, Quantmate Zurich

Nicole Koenigstein is the Chief AI Officer & Head of Quantitative Research at quantmate, where she leads innovative projects at the intersection of artificial intelligence and quantitative research. She is passionate about making AI and machine learning accessible, guiding companies from initial concepts to real-world deployment, and sharing practical knowledge through consulting and workshops. As a guest lecturer, Nicole teaches Python, machine learning, and deep learning at various universities. She has spoken at more than 50 international events in AI and quantitative finance, where she shares insights and emerging industry trends. Nicole is also the author of Math for Machine Learning and Transformers in Action (Manning Publications). Her forthcoming book, Transformers: The Definitive Guide, will be published by O’Reilly Media. Through her books, lectures, and blog, she is dedicated to making complex topics approachable while building a vibrant community of learners and professionals.

Resources

Learnings

Practical reasoning in a modular mind: Peter Carruthers, a cognitive scientist says that human mind is modular. We are rediscovering the architecture of the mind.
Fodor, the modularity of mind
Mind contains specialized systems
Carruthers shows that even our most flexible and creative forms of thoughts arise from a perception-belief-desire-planning architecture in which numerous specialized modules collaborate, making the human mind massively modular
How Chain of Thought became reasoning ?
Models discovered latent reasoning without weight changes.
CoT for agents
- Planning
- Data analysis
- Research task
- Debugging
- Decision-making
ReAct: We iteratively reason over the context
Reasoning Models: From patterns to processes - rise to thinking

Summary

Introduction

Nicole Königstein’s talk Reasoning in LLMs – Embracing the Era of Experience explores a new shift in the world of large language models (LLMs). We are moving from models that simply “answer questions” to models that can reason, reflect, plan, solve multi-step problems, and even organize themselves like teams.

Her central claim is powerful: LLMs are slowly rediscovering the architecture of the human mind — its modules, its planning mechanisms, and even early traces of metacognition.

This summary unpacks her ideas in simple words, referencing the materials and visuals from her slide deck (PDF: reasoning_LLMs_Packt_Nicole.pdf) :contentReference[oaicite:0]{index=0}.

The Human Mind as a Model for AI Reasoning

1. The Mind Is Modular (Peter Carruthers, Jerry Fodor)
Nicole opens with cognitive science. On page 4 :contentReference[oaicite:1]{index=1}, she references Peter Carruthers, who argues that the human mind is massively modular:
- Not one big “thinking blob”
- But many small specialized sub-systems (vision, memory, language, motor planning, etc.)
On page 6 :contentReference[oaicite:2]{index=2}, she shows Fodor’s diagram of the mind: a complex flowchart with boxes like language comprehension, belief systems, practical reasoning, motor output, etc.

ELI5 analogy

Think of the human mind like a company:
- Marketing department = perception
- Finance department = belief formation
- Strategy team = planning
- Operations = motor output
Each department is separate, but they all collaborate.

Nicole’s point: modern AI agents are starting to look exactly like this.

1. AI Is Converging to the Same Architecture
On page 5 :contentReference[oaicite:3]{index=3}, she says: “We are rediscovering the architecture of the mind.”

Meaning: Even though AI researchers didn’t try to copy the brain, LLM research naturally evolved toward a modular, multi-component reasoning system.

From Chain of Thought to Actual Reasoning

1. Chain of Thought (CoT) Began as a Hack
On page 9 :contentReference[oaicite:4]{index=4}, Nicole explains CoT:
- Early LLMs didn’t know how to reason.
- But prompting with “Let’s think step by step” suddenly made them explain their process.
- However, this wasn’t true reasoning. It was style imitation.

ELI5 analogy

It’s like telling a child to “show your work” in math class. They mimic the format before actually understanding the steps.

1. CoT for Agents
On pages 11–12 :contentReference[oaicite:5]{index=5}, she gives examples where CoT prompting becomes useful for agents:
- Planning
- Debugging
- Decision making
- Research tasks
- Data analysis
This is simply telling the agent: “Talk through the steps as you do them.”

This moved LLMs from answer machines thinking assistants.

ReAct: Reasoning + Acting

1. Why ReAct Matters
Starting page 13 :contentReference[oaicite:6]{index=6}, Nicole explains ReAct (a famous 2022 paper):
- Combine reasoning traces (“thoughts”)
- With actions (tool use, code execution, searches)
ReAct agents:
- Read the problem
- Think
- Take an action
- Observe the result
- Think again
- Repeat
This loop is similar to how a human solves problems iteratively.

1. Example (Weather Search)
On page 16 :contentReference[oaicite:7]{index=7}, she shows a ReAct trace for “Weather in Zurich”:
- Thought: I need to use a search tool
- Action: Internet search API
- Observation: Weather data
- Final answer: Summary of weather

ELI5 analogy

It’s like a child solving a puzzle:
1. Think
2. Try something
3. Look at what happened
4. Adjust
5. Try again
This is true “interactive reasoning.”

Moving From Pattern Matching to Learned Cognition

1. Supervised Learning on Reasoning
On page 18 :contentReference[oaicite:8]{index=8}:
- CoT was originally imitation.
- But then researchers started fine-tuning models on actual reasoning traces.
- Models began learning how to break down problems.

1. Then RL Came In
On page 19 :contentReference[oaicite:9]{index=9}:
- Reinforcement Learning (RL) rewarded correct reasoning steps.
- Bad reasoning (shallow, incorrect) got penalized.

ELI5 analogy

This is like teaching a child:
- Don’t just copy the steps.
- Understand why each step works.
- Get reward for good reasoning, not good handwriting.
This shifts LLMs from style true reasoning process.

Introspective Models: Early Metacognition in LLMs

This section is perhaps the deepest part of Nicole’s talk.

1. What Is Introspection?
Starting page 21 :contentReference[oaicite:10]{index=10}:
- LLMs are developing activation-level awareness.
- They can distinguish between:
  - Input text
  - Internal “thoughts”
  - Injected noise vectors
This is early metacognition: thinking about their own thinking.

1. Detecting Thoughts vs Text
On page 22–23 :contentReference[oaicite:11]{index=11}:
- A model can read a sentence
- AND simultaneously detect an internal “thought vector”
- AND report it separately
This is astonishing because the model is recognizing internal representations.

ELI5 analogy

Imagine reading a book and being able to say: “I see the printed words, but I also notice a thought bubbling in my mind that wasn’t in the book.”

LLMs are showing a primitive version of this.

1. Why This Matters
On page 24–26 :contentReference[oaicite:12]{index=12}:

Introspection helps with:
- Planning
- Debugging
- Detecting accidental vs intended reasoning steps
- Improved tool use
- Reliable multi-step reasoning
Claude Opus 4.x detects injected concepts ~20% of the time (page 25).

This is still early, but it’s significant.

Test-Time Compute: Giving Models Time to Think

1. Compute Isn’t Just About Model Size
On page 28 :contentReference[oaicite:13]{index=13}: Nicole explains that reasoning quality can increase if you simply give the model more time at inference:
- Larger search
- More samples
- Tree search
- Better exploration
This is called test-time compute.

1. Repeated vs Sequential Sampling
On page 29 :contentReference[oaicite:14]{index=14}:
- Repeated sampling = explore many branches
- Sequential sampling = refine one branch

ELI5 analogy

Repeated sampling = ask 10 kids to solve a puzzle Sequential sampling = one kid tries again and again

Both have value.

1. Monte Carlo Tree Search (MCTS)
On pages 30–35 :contentReference[oaicite:15]{index=15}:
- MCTS is borrowed from game AI (e.g., AlphaGo)
- Simulate multiple paths
- Evaluate
- Pick the most promising
- Refine
Adaptive Branching MCTS decides whether to:
- Go wider (try more options)
- Go deeper (invest more effort in one direction)
Nicole shows an architecture (page 33) where:
- A coder agent
- A tester agent
- A reviewer agent
- Are orchestrated by MCTS
- Running inside a sandbox
This system learns to iteratively improve code.

When Should Models Think?

1. Thinking vs Non-Thinking Mode
On page 38–40 :contentReference[oaicite:16]{index=16}: Nicole introduces the idea of:
- Thinking mode (CoT)
- NoThinking mode (direct answer)

Key idea

You don’t need chain-of-thought for:
- Easy tasks
- Factual lookups
You do need chain-of-thought for:
- Logic puzzles
- Multi-step problems
- Planning

1. Thinking Budget
On page 40–41 :contentReference[oaicite:17]{index=17}: You can set a “thinking budget”:
- E.g., 1024 tokens of reasoning
- Stop thinking after that
This prevents runaway costs.

Adaptive Compute Allocation (AdaptThink)

1. Teaching the Model When to Think
On page 42–43 :contentReference[oaicite:18]{index=18}: AdaptThink is an algorithm that:
- Lets the model self-evaluate task difficulty
- Choose thinking or non-thinking
- Use more reasoning only on hard problems

ELI5 analogy

It’s like a student who knows: “I don’t need a calculator for 5+5, but I do need one for 17 × 43.”

Models learn to be smarter with compute.

Structured and Agentic Reasoning

1. Reasoning Models as a New Class
On page 45 :contentReference[oaicite:19]{index=19}: We now have specialized reasoning models:
- OpenAI o1
- DeepSeek R1
- K2 Thinking
- AsyncThink variants
These models are trained not only for correct answers, but for process quality.

1. AsyncThink: Turning One Model Into a Team
This is one of the most advanced ideas.

On page 46–48 :contentReference[oaicite:20]{index=20}:
- AsyncThink treats one LLM as an organization
- With:
  - Organizer
  - Worker 1
  - Worker 2
  - Fork and Join operations
This lets the model:
- Break a problem into subproblems
- Assign subproblems to internal workers
- Combine results

ELI5 analogy

This is like your brain splitting a problem:
- One “mini-you” calculates
- Another “mini-you” checks errors
- A manager combines the results
This is internal teamwork.

Memory: The Next Step in Reasoning Models

1. ReasoningBank
On page 49–51 :contentReference[oaicite:21]{index=21}: Agents need memory to improve.

ReasoningBank:
- Stores distilled reasoning strategies
- Not raw text
- So models can re-use reasoning techniques
- And learn from experience
Memory-aware test-time scaling (MaTTS) further improves:
- Memory quality
- Multi-step refinement
- Self-evolving ability
This moves LLMs toward “experience-based intelligence.”

The Bigger Picture: LLMs Becoming Cognitive Systems

On page 52 :contentReference[oaicite:22]{index=22}: Nicole concludes: “Foundation models are not just tools. They are the beginning of cognitive systems with structure, roles, and collaboration.”

This is her main message:

LLMs are evolving from text predictors
Into multi-agent cognitive architectures
With modules, planning, memory, and introspection

Similar to how the mind evolved specialized systems that collaborate.

Final Summary (In Simple Words)

Nicole’s talk is about one central transformation:

LLMs are shifting from “chatbots” into “thinking systems” that have modules, plans, memory, introspection, and reasoning strategies — similar to how the human mind is organized.

She covers:

Human mind as inspiration Cognitive science says the mind is modular. LLM research is arriving at the same architecture independently.
Chain-of-thought Originally a trick. Now a foundation for teaching reasoning.
ReAct Combining reasoning (“thoughts”) with actions (tools) in a loop.
Learned reasoning Through supervised learning and reinforcement learning.
Introspection Models can detect internal activations separate from text.
Test-time compute Giving models more time and structure to think.
Adaptive compute Teaching models when to think and when not to.
Agentic reasoning Turning one LLM into an organized team (AsyncThink).
Memory and self-evolution Using ReasoningBank to accumulate transferable reasoning strategies.

Her ultimate point: We are entering an era where AI systems develop internal organization, memory, self-improvement, and modular processes — which are the beginnings of real cognitive structure.

This is why she calls it the era of experience — AI systems that learn not only from data, but from doing.

Tech Talk: Agentic AI Industry Use-cases / Agentic AI in Finance

Context

“Agentic AI in Different Industries This talk will explore how agentic AI is moving beyond a niche technology to become a transformative force across various industries. We’ll present real-world case studies demonstrating how autonomous AI systems are being deployed for tangible business value, from automating customer service and digital commerce to streamlining supply chain

Speaker

Adrian is an AI professional with 15+ years of experience, grounded in an engineering background and shaped by multidisciplinary roles across development, data/AI, project and product management, sales, and marketing. In his most recent role at Microsoft, he has worked closely with startups and scaleups, public-sector organizations, and healthcare institutions—supporting them from early ideation through production-grade AI deployments using cloud-enabled technologies. His expertise lies in bridging new pilot initiatives with product requirements and roadmaps, product-marketing strategies, and training or enablement programs.

Beyond his industry work, he is also an active author and educator, teaching both technical and executive audiences. He has created and released multiple courses with DeepLearning.ai, LinkedIn Learning, and The Linux Foundation, sharing practical insights on modern AI systems and real-world implementation.

Learnings

How it started ?
- Prompt Engineering
- Grounding RAG
- Fine tuning
Agent evolution
- Single AI models to AI agents to Poly-Agents to Multi-agents
- You have a goal, you have a task and you are giving agents a series of tools and resources. Achieve me the result
- Autonomous decision making - how to achieve a task
- Entire industry is talking about it but have not reached a state where we are ready to implement multi agents
- Poly agents : Orchestrator LLM. We pass various tools to this agent. Most of the patterns in the industry are here
- Multi-agents : Interaction between various agents
Current Trends
- Internal productivity bots
- Multi-provider Generative AI
- Generative AI for databases
- Generative AI for dashboards
- LLMOps
- Provisioned units from cloud providers
  - Reserved instance of a specific kind of model for a specific region
- Cloud and hybrid landing zones
- LLM Framework development
- New AI regulations
- AI Standards and Frameworks
Impact on the user interfaces
- Terminal: command line highly technical interfaces
- GUI: window based OS and apps
- Companion: side copilot and chat kind of applications
- Transparent: automated back-end applications, transparent for users
- Embedded: ai-enabled buttons and ui features for simple use
- Multi-Modal: user interfaces that interact with the environment
- Computer-using: automation of complex UI steps via computer use
- Adaptive UI: user level UI adaptation based on role, context, accessibility etc
- NUI: Natural interface with advanced AI avatars
Use cases
- Content generation
- Summarization
- Code generation
- Semantic Search
- Multiple model use cases: end to end call center analytics, customer 360 and business process automation
  - search through structured and unstructured documentation
Top use cases
- Compliance data mapping
  - Map some data to a specific template that you want
  - You have some data and then you want to map the data a specific regulation
  - It rarely goes wrong and is most deterministic use case.
  - EU AI Act requires companies to produce huge amount of documentation about their AI systems. There is a template that they have to provide. For example, AISBOM (AI System Bill of Materials) is one such system that has specs that can be used to create relevant documentation for a company’s internal AI systems. The use case mentioned here relates to taking these specs, and using an LLM to read tech specs and other internal guides to populate the relevant parameters
  - You can put the spec in the system prompt and then you can map the relevant data to various data fields
- Real time audio, 1800 kind of systems
  - Getting transcription and use it on LLM and then creating various tasks related to transcription
- AI + certified web search
  - Combination of AI and web search where you can get certified web searches. Not all web searches are returned but only relevant and certified web search results are returned
Finance Operations
- Explaining financial models
- Excel file analysis
- Dashboards on demand
- Financial forecasting
HR Operations
- Drafting job descriptions
- Summarizing policies
- Candidate analysis
- Performance recommendations
- Employee 360 overview
Product Development
- Rapid prototyping
- Summarizing customer feedback
Legal, Compliance, and Procurement
- Reviewing documents faster
- Flagging potential risks
- Structured contract analysis
- RFI/RFP evaluation
- Template generation
Customer Operations
- Dynamic support blueprints for agents
- Customer 360
- Automated L1 triage
- Personalized customer service
- Voice controls for accessibility
- Customer loyalty improvements
IT Operations
- Internal IT support
- Automated upskilling paths
- Auto documentation generation
Technical Car Inspection
- Visual asset analysis
- Live access to technical specs
- Voice-to-data controls
Best use case for 2025: fluid UI that will change dynamically in real time

Evolution

Overall Message of the Slide

This slide explains the industry’s progression from traditional, isolated AI systems toward advanced, interacting agent ecosystems. Each stage increases autonomy, reasoning capability, coordination, and emergent behavior.

The Four Stages

1. Single AI Models (Pre-GenAI) Traditional machine learning models built for one narrowly defined task.

No reasoning
No tool use
No workflow capability
Humans manually integrate outputs

2. AI Agents (Unitary Copilots) One LLM-based agent that can reason, plan, and call tools.

Executes multi-step tasks
Functions like an intelligent analyst
Still a single monolithic agent

3. Poly-agents (Orchestrated Workflows) A supervising “manager agent” delegates tasks to multiple specialized sub-agents.

Structured, hierarchical
Sub-agents rarely interact with each other
Often better described as AI-driven workflows
Used widely in early agentic systems today

4. Multi-agents (Interactive Ecosystems) Multiple agents communicate, cooperate, debate, or negotiate.

Autonomous peer-to-peer interactions
Emergent behaviors
Resembles a team or committee
Capable of tackling complex, ambiguous tasks

Key Idea

The evolution moves from isolated models single copilots orchestrated workflows interacting agent ecosystems.

Finance maps naturally to this progression due to its committee structures, specialist teams, and multi-step decision processes.

Summary

Big Picture

This talk explains how “agentic AI” is moving from buzzword to practical infrastructure for real businesses. Instead of a single LLM answering a prompt, we now design agents: systems that can understand a goal, plan steps, call tools and services, and return a result with minimal human babysitting.

The speaker walks through:

How we got from classic ML to agentic systems
What trends are shaping the ecosystem (cloud, regulations, LLMOps)
How user interfaces are changing
Concrete use cases across functions, with many that map naturally to finance and operations :contentReference[oaicite:0]{index=0}.

The underlying message: most organizations today are somewhere between “simple copilots” and “poly-agent workflows,” and multi-agent ecosystems are the next step—but we’re not fully there yet.

From Prompts to Agentic Systems

1. How It Started
The talk begins with the now-familiar GenAI stack (slide “How it started”) :contentReference[oaicite:1]{index=1}:
1. Prompt engineering – You feed instructions + examples into a general LLM.
  - In-context learning, few-shots, role prompts, templates.
  - Works surprisingly well, but is brittle and hard to scale.
2. Grounding / RAG – Instead of relying on the model’s training data, you let it search your own knowledge base.
  - Vector search over documents, then ask the model to use those passages.
  - This makes answers more accurate and auditable.
3. Fine-tuning – Only if 1 and 2 are not enough.
  - Teach the model new skills or formats.
  - More expensive and slower to iterate, so it’s the last resort.
This is still “single-shot LLM use”—very powerful, but not yet agentic.

1. The Evolution to Agents
The “Agents or Workflows?” slide shows four stages of maturity :contentReference[oaicite:2]{index=2}:
1. Single AI Models (pre-GenAI)
  - A model for fraud scoring, another for credit risk, another for churn.
  - Narrow tasks, no tool use, humans glue everything together.
2. AI Agents (unitary copilots)
  - A single LLM-based assistant that can reason and call tools.
  - For example, an assistant that can read a doc, query a database, and explain results all in one flow.
  - Still one “big brain,” but it can perform multi-step tasks.
3. Poly-agents (orchestration)
  - A manager/orchestrator agent delegates to sub-agents (e.g., a retriever, summarizer, code generator).
  - The structure looks like a workflow: the orchestrator decides which tool or sub-agent runs next.
  - Most industry patterns today sit here: “agentic pipelines” more than true multi-agent societies.
4. Multi-agents (interaction)
  - Multiple agents talk to each other, negotiate, critique, or collaborate.
  - Less like a fixed workflow, more like a committee or trading floor.
  - In theory, great for ambiguous, multi-stakeholder problems (like complex financial decisions). In practice, most companies are still experimenting.
Your notes capture the key idea: industry is excited about multi-agents, but most production deployments are still at poly-agent stage.

Current Industry Trends

The “Current trends” slides outline the ecosystem that makes agentic AI viable in real organizations :contentReference[oaicite:3]{index=3}:

Internal productivity bots – ChatGPT-like copilots tailored to internal docs.
Multi-provider GenAI – Using multiple cloud LLMs (e.g., Azure + AWS) for resilience, cost, or capabilities.
GenAI for databases & dashboards – Natural language to SQL, auto-generated BI reports, conversational analytics.
LLMOps – Monitoring, evaluation, automation, and testing for LLM-based apps.
Provisioned Units (PTUs, etc.) – Reserved, capacity-based model access in specific regions (relevant to regulated industries).
Cloud & hybrid landing zones – Network, security, API gateways, identity needed to safely expose tools to agents.
LLM frameworks – LangChain, Semantic Kernel, Ollama, etc., which standardize how you build agents.
Regulations & standards – EU AI Act, Canada AIDA, ISO 42001, NIST AI frameworks; these drive documentation-heavy, compliance-oriented use cases.

A second governance slide adds:

Data governance and DSPM (data security posture for AI)
Internal AI policies and shared-responsibility models
AI red teaming and regulatory compliance

In short: agentic AI is not just “cool demos,” it’s sitting inside a serious governance and infra stack.

Changing User Interfaces

One of the most useful slides is the “Impact on the user interfaces” spectrum ranging from terminal to natural UI (NUI) :contentReference[oaicite:4]{index=4}:

Terminal – Classic command line (UNIX, MS-DOS).
GUI – Windows / macOS style windows and icons.
Companion – Side copilots and chat panes (like Microsoft Copilot).
Transparent – AI runs in the background (document processing, analytics).
Embedded – Little AI buttons inside apps (e.g., “summarize slide deck”).
Multi-modal – Interfaces that see, hear, and sense environment.
Computer-using – Agents that drive the UI like a human (Operator, Claude).
Adaptive UI – Layouts that change by role, context, and accessibility needs.
NUI – Natural user interface, often with avatars and richer presence.

Your note “best use case for 2025: fluid UI that will change dynamically in real time” fits here: agentic AI will sit under UIs that adapt on the fly depending on who you are, what you’re doing, and what the agent is trying to accomplish.

Core Use-Case Families

1. Horizontal GenAI patterns
From the “Use cases” slide and your notes :contentReference[oaicite:5]{index=5}:
- Content generation – Responses in call centers, personalized website copy.
- Summarization – Call transcripts, internal SME documents, social media trend summaries.
- Code generation – Natural language to SQL, queries to proprietary data models, code documentation.
- Semantic search – Searching reviews, internal knowledge, or regulatory docs.
Multiple-model, end-to-end scenarios:
- Call center analytics: classification, sentiment, entity extraction, summarization, and outbound email.
- Customer 360: summarizing interactions, searching history, generating tailored offers.
- Business process automation: combining search, code generation, and content generation over both structured and unstructured data.
These are “building blocks” that can be reused in finance, HR, legal, etc.

1. Compliance Data Mapping (Very Strong Fit for Finance & Regulation)
One of the most deterministic and practical use cases you captured:
- You have a template required by regulators (e.g., EU AI Act documentation, AISBOM).
- You have a large pile of internal documentation: tech specs, architecture, policies.
- The agent:
  - Reads the regulatory template placed in the system prompt.
  - Uses RAG over your internal docs.
  - Maps content into the correct template fields.
Why this is powerful:
- The structure is fixed (template).
- The model is mostly doing extraction and mapping, not free-form invention.
- Low error rate if prompts and retrieval are set up carefully.
This pattern generalizes to many “map internal data regulatory format” tasks.

1. Real-Time Audio + 1-800 Systems
- Live transcription of customer calls.
- LLM runs on the transcript to:
  - Summarize the call.
  - Extract entities, reasons for contact, sentiment.
  - Trigger follow-up tasks (ticket updates, compliance checks, knowledge base improvements).
This is where agentic chains (transcribe summarize classify act) become very natural.

1. AI + Certified Web Search
- Instead of free, open web search, the agent only uses certified sources: curated domains, regulatory sites, trusted internal portals.
- Important for finance and healthcare where hallucinated or untrusted sources are unacceptable.

Functional Use Cases by Department

The talk lists many cross-functional patterns (your notes align with the slide “Other use cases”) :contentReference[oaicite:6]{index=6}:

Finance operations – Explain financial models, analyze large Excel files, generate ad hoc dashboards, assist forecasting.
HR operations – Draft job descriptions, summarize policies, screen candidates, suggest performance review language, build employee 360 views.
Product development – Rapid prototyping, summarizing customer feedback into requirements.
Legal / compliance / procurement – Faster document review, risk flagging, structured clause analysis, RFI/RFP scoring, standardized templates.
Customer operations – Dynamic support scripts for human agents, customer 360, L1 triage, personalization, accessibility via voice.
IT operations – Internal helpdesk copilots, tailored upskilling paths, automatic documentation.
Technical car inspection (example vertical) – Visual inspection of assets, instant access to technical specs, voice-to-data entry.

All of these can be implemented as:

Single copilots,
Poly-agent workflows,
Or gradually as interactive multi-agent systems.

Key Takeaways

Agentic AI is the next layer above LLMs, RAG, and fine-tuning: it’s about goal-driven systems that can plan, use tools, and coordinate.
Most real-world systems today are poly-agents (orchestrated workflows), not yet fully autonomous multi-agent societies—but the direction is clear.
Cloud, governance, and regulations are not side topics: they strongly shape which use cases are viable, especially in finance.
User interfaces will become fluid and adaptive: copilots, background automations, embedded buttons, multi-modal and computer-using agents, and eventually natural avatar-based interfaces.
The most robust early wins are structured, deterministic tasks: compliance mapping, call center analytics, and documentation-heavy workflows where the agent is more of a tireless junior analyst than a freewheeling “AI brain.”

Tech Talk: Agentic LLM Architecture by Hand

Context

A deep dive into the core components and Architecture that give LLMs their intelligence, with a focus on how systems plan, reason, and manage context

Speakers

Mohsena is a Doctoral Researcher at CU Boulder working with Dr. Tom Yeh at the Center for the Brain, A.I., and Child (BAIC). Her research focuses on chatbots, neuroimaging and human-AI interaction, especially in regards to young children

=Dr. Tom Yeh is an Associate Professor of Computer Science at the University of Colorado Boulder, where he founded the AI by Hand ?? initiative — a global educational movement with the mission of teaching AI architectures and algorithms to 1 million people worldwide.

Yeh’s research spans Generative AI, AI education, assistive technology, computer vision, brain imaging, 3D printing, and STEM education, with a strong focus on bridging human-centric design and frontier AI techniques. His group develops new ways to make complex AI math and algorithms accessible to learners, practitioners, and underserved communities globally.

He has published over 150 peer-reviewed articles with an h-index of 39, and received Best Paper Awards and Honorable Mentions from top venues including CHI, UIST, SIGCSE, ASSETS, and MobileHCI. His research has been funded by the National Science Foundation (NSF), National Institutes of Health (NIH), DARPA, the Knight Foundation, and the Piton Foundation.

Yeh earned his PhD in Computer Science at the Massachusetts Institute of Technology, focusing on vision-based user interfaces, and was a postdoctoral fellow at the University of Maryland Institute for Advanced Computer Studies (UMIACS). Today, in addition to his academic research, he is widely recognized as an educator and thought leader with a global following of more than 100,000 learners across platforms such as LinkedIn, Substack, and YouTube

Learnings

Liked the fact that the speaker did all the explanation using a plain simple excel. I have never come across someone who has explained RAG via excel. It is always some jupyter notebook or some streamlit app that is used. But in this case, I am happy to see someone explaining most of the concepts pretty well on an excel

Round Table: AI Engineering

Context

A forward-looking discussion on emerging trends and the technical and logistical hurdles of moving from experimental agents to robust, scalable production systems.

Speakers

Chip Huyen Researcher, Tep Studio San Francisco

Chip works at the intersection of AI, education, and storytelling. Her career includes roles at Snorkel AI and NVIDIA, founding an AI infrastructure startup, and teaching Machine Learning Systems Design at Stanford.

Her first English-language book, Designing Machine Learning Systems (2022), became an Amazon bestseller in AI and has since been translated into more than ten languages. Her latest book, AI Engineering (2025), has quickly become the most-read title on the O’Reilly platform since its release

Paul Iusztin

Senior AI Engineer | Founder, Decoding AI Paul is a senior AI engineer and contractor with over eight years of experience, specializing in designing and implementing modular, scalable, and production-ready ML systems for startups worldwide. His central mission is to build an AI academy and create data-intensive AI/ML products that serve the world.

Since training his first neural network in 2017, he has been driven by two core passions: Designing and deploying production-grade AI/ML systems using software engineering and MLOps best practices. Teaching others about the process.

Currently, he develops vertical AI agent solutions for a U.S.-based startup. Previously, he built AI, ML, and MLOps products for Metaphysic, CoreAI, Everseen, and Continental.

He is also the founder of Decoding ML, a platform offering battle-tested content on designing, coding, and deploying production-grade AI systems. As a content creator, he co-authored the LLM Engineer’s Handbook, an Amazon bestseller and widely recognized industry reference on LLM, RAG, and LLMOps solutions, with over 15,000 copies sold

Maxime Labonne

Head of Post-Training, Liquid AI London, United Kingdom

Maxime is a Machine Learning Scientist with a PhD from the Polytechnic Institute of Paris and an AI/ML Google Developer Expert. Since 2019, he has worked extensively with Large Language Models (LLMs) and Graph Neural Networks (GNNs), applying them across R&D, industry, finance, and academia.

He has created numerous popular LLMs on Hugging Face, including AlpahMonarch-7B, Beyonder-4x7B, Phixtral, and NeuralBeagle14, as well as widely used LLM tools such as LLM AutoEval, LazyMergekit, LazyAxolotl, and AutoGGUF.

As an educator and content creator, he authored the widely acclaimed LLM course on GitHub (with over 39k stars) and regularly publishes technical articles on his blog and Towards Data Science. He is also the author of the technical books LLM Engineer’s Handbook and Hands-On Graph Neural Networks Using Python.

Learnings

AI engineer is broadly about strong software skills to AI projects. One needs to have skills in LLM Ops, scaling data pipelines. It is an interdisciplinary and you got to know a bit of everything.
Data scientists struggle to write scalable code. POC purgatory
Most people write prompts well enough
Data in the model: Usually people focus too much on the code, too much on llm
Biggest impediment is the lack of clarity on what to build. Pivoted too much. Too much complexity with out the needed business case. We started using RAG but it is very complex. MCP, light LLM introduction increased the complexity.
Mindset or design pattern is not - turble workflow AI frameworks and Orchestrator frameworks : DBOS
AI framework = LangChain, LlamaIndex, CrewAI, Guardrails, DSPy:defines logic, steps, prompts, RAG flow, agent patterns.
Orchestrator = DBOS, Ray, Prefect, Flyte: runs tasks, schedules jobs, handles concurrency, queues, retries, statP
Together they give you: deterministic logic (AI framework) and reliable execution (orchestrator)
DBOS provides all of these as part of the orchestrator, so you don’t need RabbitMQ/Kafka/Celery/Redis.
Light LLM, Langchain route is something that you can start
There is also lower level API from LangChain that can work
OpenRouter is a unified API platform that lets you access many different LLMs (OpenAI, Anthropic, Google, Meta, Mistral, Groq, etc.) through a single API, without needing separate keys, billing setups, or vendor-specific code.
Small models are hard to work with as you need more guard rails. Usually there is not much testing done in the downstream apps using small models and hence one might have to do more work. On the flip side, there are a lot of advantages to working with small models
- Performance of small models is getting better and better
When a new trend comes in, we go for centralized. We started with big models and that is only way we can access. But as time goes, models will be decentralized and become smaller
AI is pretty powerful and it can do a lot of things now. Best way to learn is to build something. If you don’t know what to build, you run the risk of becoming too specific. You need to be a system designer. Having a big picture of AI
Build something end to end. Build it, deploy it and deploy everything. that is the only way to learn. Unless you deploy and learn from end to end, you will never learn. Having the product mindset with human layers, on the top of that, thinking about how to have a flexible ai architecture
What should AI engineer focus on ?
- Breadth and Depth
- When you focus on one specific Data cleaning, vector db, evals, you become depth focused
- Try end to end small scale POC. Out of 10 POCs you build, may be one becomes powerful and then you can decide whether you want to be a depth person or breadth person
- It is valuable to have different skills. Looking from the perspective of AI company, they need specific skillsets and hence it might make sense to specialize. If you are learning about AI, try to find your niche after having a good idea of the field.
  - You provide value to the company. You become a very valuable if you have a specific skillsets
- Startups appreciate breadth. As the company grows, the company hires specialist roles.
- If you go to a startup world, you will have to do all the things. If you go to a bigger company, you will have to specialize
How do you recommend AI engineers to think about Systems thinking
- You actually need to build an end to end system to build a systems thinking
- Draw diagrams and then connect to pieces
- Try to separate the concerns between components.
- Building an end to end project: start with data source as dirty as possible and build on stuff
- How do we teach AI to teach systematically ?
- Reimplementing everything from stuff is important
- You need to be confronted with a lot of problems before you can come up with an elegant solution
- https://github.com/mlabonne/llm-course
  - Have an understanding of everything. You cannot focus on everything on the list. You will be have to work on stuff that appeals to you
- Get a bigger picture will give you flexibility to know what can be automated
What are non technical skills that matter for an AI engineer in production ?
- Communication and team work.
- Get out of notebooks. Get used to writing out of notebooks
- Start writing in Go, RUST and other languages
- Start deploying stuff. What you see on paper and deploying it is completely different
One piece of advice to youngsters
- Getting a broad overview of the field and pick up something important
- Start building stuff. Learn from people and iterate.
None of the frameworks are good. You will have to build from scratch. If you want to tweak things, you will have to work through it. Graphs makes it complicated. LangGraph functional API is better. DBOS durable workflow tool specialized for agents. Gives ability to automate pipelines etc
No code or low code tools are useful
- Ease of use and customization. No code low code is useful for quick play. Once it grows in complexity, you will need to know more. Not too much of a fan of low code tools
Maxime Labonne has a course on LLM that is worth checking out

Multimodal AI

Context

A forward-looking talk about agents that don’t just read and write text, but also see and act in visual or physical environments. The speaker will present the latest research on integrating computer vision and robotics with language-model agents

Speaker

Max Tschochohei AI Engineering, Google Cloud Munich

Learnings

Meta: help any business create an ad
Agentic AI
- Accelerate Creative
- Process automation
- Placement Optimization
There are infinite modalities : audio, text, video, code, heart rate, dna, x-ray, protein, traffic, data
regardless of modalities, everything goes through encoder decoder
- encoder-latent space - decoder
One of the limitations of transformers is that they processed only text. For images, one had to translate to text and then use transformers
Image models have different encoders : Variational Auto encoders, GAN and Diffusion models
VAE; These models generate images from similar images; they are fast, efficient but the results are smooth and unnatural
GAN: These models generate sharp and realistic images but they suffer from model collapse
Diffusion models: Work by adding noise and try to restore the original image by step wise removal of noise

Workshop: Context Engineering for LLMs

Context

This workshop will tackle the critical challenge of persistent memory in AI agents, a central point of frustration for developers. We will dive into practical strategies for context engineering, including advanced techniques like Retrieval-Augmented Generation (RAG) and document chunking. The session will also introduce GraphRAG, demonstrating how knowledge graphs can be used to provide agents with accurate, hallucination-free data, ensuring systems can handle context over extended interactions

Speaker

Denis believes that true understanding comes only when you can teach someone else how to do something.

A graduate of Sorbonne University and Paris Cité University, he later taught at Panthéon Sorbonne University, where he authored and registered one of the earliest patents for word tokenization and encoding systems, followed by a second patent for a conversational human-machine system. More than 30 years ago, he deployed some of the first large-scale AI software systems, moving early and decisively into real-world applications.

From the beginning of his career, he designed and implemented:

A cognitive NLP chatbot used as a language-learning system by major global companies.

AI systems for aerospace corporations, supporting high-stakes engineering and operations.

AI-driven resource optimization tools for industries ranging from luxury goods to automotive.

An advanced general-purpose planning and scheduling (APS) solution now used worldwide.

He became an early expert in explainable AI (XAI), ensuring that the enterprise AI systems he built included interpretable, acceptance-driven explanations—critical for adoption in sectors such as aerospace, apparel, and complex supply chains

Resources

From Prompt to Context Engineering

The entire process is a data pipeline. The output from one AI prompt is stored in a variable, and that variable is then used as the input for the next prompt. This allows you to break a complex analysis into a series of simpler, focused tasks.

Step 1: Filtering the Raw Data

Prompt: `prompt_g2` Goal: To clean the raw transcript and separate facts from noise. Input: The full, original `meeting_transcript` variable. Task: It asks the AI to “isolate the substantive content” (decisions, problems, updates) from the “conversational noise” (greetings, coffee talk). Output: The cleanedup text is saved in the `substantive_content` variable. This variable is the “link” that feeds the next steps.

Step 2: Running Parallel Analyses on Clean Data

Now that you have clean data in `substantive_content`, you can run multiple different analyses on it. `prompt_g3` and `prompt_g4` run in parallel, both using the same clean input.

Step 2a: Identifying What’s New

Prompt: `prompt_g3` Goal: To compare the meeting to a previous summary and find only the new information. Input:

`substantive_content` (the clean data from Step 1).
`previous_summary` (a new piece of context representing “old” knowledge).

Task: “Identify and summarize ONLY the new developments… that have occurred since the last meeting”. Output: The list of new developments is saved in the `new_developments` variable.

Step 2b: Analyzing Subtext and Mood

Prompt: `prompt_g4` Goal: To analyze the human element of the meeting, not just the facts. Input: `substantive_content` (the same clean data from Step 1). Task: “Analyze… for implicit social dynamics and unstated feelings” (hesitation, tension, mood). Output: The analysis is saved in the `implicit_threads` variable. (This variable isn’t used later, but it’s part of the parallel analysis).

Step 3: Formatting Data for Action

Prompt: `prompt_g6` Goal: To take the list of new information and structure it into a clear, usable format. Input: `new_developments` (the output from Step 2a). Task: “Create a final, concise summary of the meeting in a markdown table”. Output: The resulting table is saved in the `final_summary_table` variable.

Step 4: Generating a RealWorld Action

Prompt: `prompt_g7` Goal: To use the structured data to perform a final, actionable task. Input: `final_summary_table` (the formatted table from Step 3). Task: “Based on the following summary table, draft a polite and professional followup email”. Output: The final email text is saved in the `follow_up_email` variable.

This chain demonstrates a powerful workflow: Filter Data (`g2`) > Analyze Data (`g3`, `g4`) > Structure Data (`g6`) > Act on Data (`g7`).

Learnings

Prompt Engineering is a subset o Context Engineering
How you are going to structure what is inside the prompt so that you can make it structured and goal-driven, enables multi-turn, stateful conversation and builds a robust, predictable framework
Prompting is brittle and shallow, typically single-turn, relies heavily on user phrasing
Look at meeting transcript use case. You can copy paste transcript but you have to clean it.
- Clean it
- Extract what is interesting in it
If you give an instruction, the LLM will fulfill the instruction
Simulate an RAG. Given the previous summary and asking it to analyze the current meeting in the context of the previous meeting
You can chain the context by providing the previous context and the current context
Identify the key trends
Different tasks with the same data
Determining the action based on output of various tasks
How to build this kind of chain of prompting across several meeting? How does one scale it?
Learn how to think in ideas?
Semantic Role Labeling deconstructs a simple sentence into its deep structural components : “Who did what to whom, when, and where”. This process transforms language from a linear string of words into a structured “Semantic Blueprint”
Key concepts that can use and reuse constantly
Thinking in terms of ideas
Breaking down into little concepts
No way to remember all the contexts
How am I going to remember the instructions ?
The goal is to be a mechanic so that you can go in any car
I need an agent that is going to plan. The engine will have a planner, executor and execution trace( so that users are going to see everything)
Agent registry : Which agent should I get to get this done?
Helper functions to get LLM data, vector store
Specialists agents such as summarizer agent, researcher agent, extractor agent, librarian agent, write agent.
Dual RAG : Memory. I want the memory. Why put the document in vector store?
What happens when you poison the data ? There is a moderator and it is going to work on the input
Gives all kinds of instructions as blueprints
Nice little agents are available
Need a registry. You have a whole company of agents. The job description are provided in the registry
I am giving all kinds of instructions to do stuff. Dependency injection
Each agent knows what is going on with other agent
The trace makes sure that you know everything that is going on
Prompts are gone. Even if you chain prompts, you will not go far. You are going to need agents, vector store, test out with poison data, go with platforms, use your custom context. Use standards
Is context engineering especially relevant for multi agent solutions because of the complexity and passing the information from one agent to another ?
- Of course imagine with a meeting with a set of people with headphones and nobody is listening to anyone. If you put context engineering, you ask them to put the headphone down and listen to one another.
- Without context engineering, you cannot build a multi-agent system
2 million items: There are 2 million products, store them in any form before going to vector store. Extract the info from each product, put the metadata in to vector store and it refers to the text in the traditional database. The sources are cited. You can run the program with metadata stored in the vector store and then you can do whatever actions on the it. You can pull the actual data out of your actual database
Model agnostic, database agnostic - the speaker is not affiliated with any database. Whatever works for your usecase is the best option
For a book the author chose openai, pinecone because it is educational
Model agnostic, platform agnostic - any platform that fits your need
The market will drive your responses. That’s the only thing that should matter

Summary I

Context engineering goes far beyond prompt engineering and addresses one of the central challenges in modern AI systems: managing memory, structure, and meaning across long, multi-step interactions. While prompt engineering focuses narrowly on crafting single instructions for an LLM, context engineering establishes an entire operational framework for how AI agents ingest data, preserve state, share information, and take action over time. Prompting alone is brittle—overly sensitive to phrasing and insufficient for multi-turn reasoning. In contrast, context engineering makes interactions goal-driven, structured, and scalable.

A key example discussed was multi-step meeting analysis. Raw transcripts must first be cleaned to remove noise, then processed through parallel analytic tasks—detecting new developments, extracting subtext and sentiment, identifying trends, and generating structured summaries. These outputs eventually feed into downstream actions such as follow-up emails. This demonstrates a chain-of-thought pipeline: filter analyze structure act. Each stage stores outputs in variables that become inputs to the next step. This simple but powerful pattern forms the foundation of context-driven multi-agent workflows.

Underlying these workflows is the need to represent language as structured ideas rather than linear text. Techniques like Semantic Role Labeling (“Who did what to whom, when, and where”) deconstruct sentences into semantic blueprints. Thinking in terms of reusable ideas, micro-concepts, and semantic roles is what enables agents to maintain coherence across multiple turns, documents, or meetings. Humans cannot remember all instructions or context, so systems must externalize memory through vector stores, document repositories, metadata, and per-agent state.

This naturally leads to multi-agent architectures. Modern AI systems require planner agents, executor agents, summarizers, researchers, extractors, librarians, and writers—each with a clear job description registered in an agent registry. Agents exchange information through shared context, and execution traces ensure transparency: the user should always be able to see how conclusions were reached. Dependency injection between agents allows them to coordinate without being tightly coupled. Without context engineering, multi-agent systems fail because each agent operates in isolation—like people in a meeting wearing headphones and not listening to one another.

The workshop also emphasized Retrieval-Augmented Generation, especially Dual RAG: (1) retrieval for authoritative documents, and (2) memory retrieval for persistent context across sessions. Vector stores are not simply embedding repositories; they serve as long-term memory mechanisms that allow agents to bring forward relevant history. The speaker stressed that the system must remain model-agnostic and platform-agnostic. Whether using OpenAI with Pinecone or any other stack, the correct choice is always what fits the real business need.

Finally, the talk addressed scalability: with millions of items or documents, metadata extraction becomes essential. Vector stores store semantic fingerprints and metadata, while raw data remains in a traditional database. This separation enables accurate citations, hallucination-free retrieval, and reliable downstream actions. Poisoned or corrupted inputs must be mitigated with moderators or validation layers—reinforcing that context engineering is not only about adding structure but also about enforcing safety.

In summary, context engineering transforms LLMs from clever responders into dependable systems. It provides memory, structure, coordination, and traceability—making multi-agent architectures feasible and enabling AI to behave less like a chatbot and more like a team of expert collaborators.

Summary II

In modern AI systems, the most serious limitations no longer come from model size or raw language ability—they come from the inability of systems to remember, organize, and meaningfully use information across time, tasks, and agents. Today, context engineering is the discipline that fills this gap. If I were speaking as an NLP expert on this topic, I would argue that context engineering is becoming the backbone of every high-performing multi-agent system, and that memory—rather than prompting—will define the next generation of intelligent applications.

At its core, context engineering is the practice of shaping how AI systems interpret, store, retrieve, and act upon information. Traditional prompt engineering treats a model like a stateless autocomplete engine: we give it a question, it gives us an answer, and the state disappears. But as soon as we attempt multi-step reasoning, multi-document synthesis, conversational planning, or cross-agent cooperation, stateless prompting collapses. Context must persist, evolve, and be selectively recalled, just like human working memory.

This is where the architecture matters more than the prompt itself. A well-designed system combines four components: (1) structured memory, (2) context retrieval, (3) role-based agent delegation, and (4) execution tracing. Memory can take several forms—vector stores for semantic recall, knowledge graphs for relational reasoning, document stores for authoritative references, and episodic logs for state tracking. Retrieval mechanisms use semantic similarity, symbolic rules, or hybrid methods to decide what the system should bring forward at each step. The goal is not simply to store information but to decide what matters right now.

Multi-agent systems elevate this idea by distributing responsibilities. Instead of one monolithic LLM doing everything, we design planners, solvers, researchers, critics, and explainers—each with instructions, tools, and knowledge scopes. Context engineering defines how they communicate: what information one agent passes to another, when they should rely on memory, when they should query an external source, and how they reassemble insights into a coherent output. Without this structured flow, agents either duplicate work or contradict each other.

Memory plays a critical role in grounding these agents. Imagine a team of human specialists who can perform expert tasks but forget what they decided five minutes ago. They cannot collaborate, cannot summarize decisions, and cannot plan long-term. This is exactly the challenge with naïve multi-agent LLM systems. Memory provides continuity—turning isolated tasks into an evolving narrative that agents can build upon. Whether implemented with vector embeddings, retrieval-augmented generation, or graph-based knowledge indexing, memory ensures that agents align with previous decisions and avoid hallucinating facts inconsistent with stored knowledge.

However, memory alone is insufficient. We also need *context shaping*—the ability to compress, rank, filter, and reframe information before injecting it into the model. Large contexts without structure often lead to degraded performance. Effective context engineering transforms raw data into distilled, role-relevant packets that an agent can actually use. It prioritizes the semantic core, not the textual bulk.

Ultimately, context engineering is not an add-on—it is the operating system of agentic AI. As models become more powerful, the systems built around them must become equally sophisticated. The future of AI will be defined by how well we engineer context, orchestrate memory, and enable agents to think, collaborate, and act with purpose. This is where true intelligence will emerge.

Resources

Below is a curated list of the strongest books, papers, and resources to deeply learn context engineering, retrieval-augmented generation, knowledge graphs, and multi-agent AI systems.

Each entry includes:

What the resource teaches
Why it is valuable
How it fits into an overall learning journey

Books

1. Building AI Agents with LLMs, RAG, and Knowledge Graphs
:AUTHOR: Salvatore Raieli, Gabriele Iuculano :LINK: https://www.oreilly.com/library/view/building-ai-agents/9781835087060
- End-to-end practical guide to designing and deploying AI agents.
- Covers embeddings, RAG, knowledge graphs, memory architectures.
- Strong emphasis on multi-agent orchestration and production-grade design.

1. AI Engineering
:AUTHOR: Chip Huyen :LINK: https://huyenchip.com/
- Focuses on building real-world AI systems (not just models).
- Teaches deployment, observability, data pipelines, evaluation.
- Bridges the gap between theory and industry-scale AI infrastructure.

1. The LLM Engineering Handbook
:AUTHOR: Paul Iusztin & Maxime Labonne :LINK: https://github.com/pinecone-io/llm-engineering-handbook
- A practical handbook for prompt engineering, fine-tuning, evaluation.
- Strong coverage of system design around LLMs.
- Good for developers transitioning from “prompting” to “engineering.”

1. Mastering Independent AI Agents with Knowledge Graphs, RAG, and LLMs
:AUTHOR: Reid Orian :LINK: https://books.google.com/books/about/Mastering_Independent_AI_Agents_with_Kno.html?id=zEx30QEACAAJ
- Deep dive into designing multi-agent systems that can reason and act autonomously.
- Strong focus on agent memory, planning, world models, and tool usage.
- Excellent for readers interested in long-term, persistent agent behavior.

1. Designing Intelligence: A Handbook for Building AI Systems
:AUTHOR: Doug Lenat
- High-level conceptual framework from a symbolic AI pioneer.
- Helps in understanding structured representations, concepts, and context.
- Good complement to LLM-centric materials.

1. Knowledge Graphs: Fundamentals, Techniques, and Applications
:AUTHOR: Dieter Fensel et al.
- Teaches formal graph-based knowledge representation.
- Useful if you want deep grounding for graph-based RAG (GraphRAG).

Technical Guides & Papers

1. Knowledge Graph-Based Retrieval-Augmented Generation (KG-RAG)
:LINK: https://arxiv.org/abs/2405.12035
- Introduces hybrid retrieval combining embeddings + graph reasoning.
- Excellent for designing hallucination-free, traceable systems.

1. ReAct: Synergizing Reasoning and Acting in Language Models
:LINK: https://arxiv.org/abs/2210.03629
- Foundational for connecting LLM reasoning with tool calling.
- A key inspiration behind agent planning.

1. Toolformer: Language Models Can Teach Themselves to Use Tools
:LINK: https://arxiv.org/abs/2302.04761
- Explains how LLMs learn to call APIs and tools.
- Highly helpful for building agent toolboxes and registries.

1. Self-RAG: Learning to Retrieve, Generate & Critique
:LINK: https://arxiv.org/abs/2310.11511
- Teaches how to build self-correcting agents with internal feedback loops.
- Relevant for advanced multi-agent designs with critic/planner roles.

Courses & Tutorials

1. Stanford CS224U – Natural Language Understanding
:LINK: https://web.stanford.edu/class/cs224u/
- Provides foundational NLP understanding.
- Helpful for building intuition for semantic roles, embeddings, vectors.

1. Deeplearning.ai – Building RAG Applications
:LINK: https://www.deeplearning.ai/
- Concrete exercises on retrieval pipelines, indexing, chunking, evaluation.
- Excellent hands-on complement to the books above.

Recommended Learning Path

Start with Building AI Agents with LLMs, RAG, and Knowledge Graphs.
Add AI Engineering to understand how to deploy and maintain them.
Study agent-specific books like Mastering Independent AI Agents.
Strengthen fundamentals through CS224U (semantics, vectors, SRL).
Learn hybrid retrieval (KG-RAG, Self-RAG) to reduce hallucinations.
Build real prototypes and measure them end-to-end.

If you want, I can also create:

A shorter list of “Top 5 must-read books”
A printable 1-page Org-mode cheat sheet
A roadmap showing what to study in each month of a 6-month plan
A list tailored specifically to your LSEG agentic projects

Just tell me!```

Keynote Session: Agentic AI Future Trends

Context

This keynote will explore the most critical trends and projections shaping the next wave of artificial intelligence. It will begin with the premise that 2025 is a significant turning point, a “Year of Agents,” where AI transitions from foundational automation to sophisticated, autonomous systems. The talk will examine the shift from LLMs that are “fundamentally passive” to proactive, agentic systems that can independently initiate actions and manage complex, multi-step tasks with minimal human intervention

Speaker

Gideon Mendels

Before founding Comet, he worked as a data scientist at Columbia University, Groupwize, and Google, focusing primarily on NLP applications such as document classification, language modeling, code-switch detection, and keyword search. In each role, he frequently encountered the same challenge: machine learning projects with no historical record of how earlier models were developed, what approaches had been tried, or why certain decisions were made. Every new iteration felt like starting from scratch.

This recurring pain point sparked the idea for Comet.

Today, Comet provides for machine learning what GitHub provides for code. It enables data scientists and ML teams to automatically track datasets, code changes, experiments, and models, ensuring efficiency, transparency, and full reproducibility—from training all the way to production

Resources

https://github.com/comet-ml/opik

Learnings

Speaker worked on hate speech detection
CEO of Comet
90 percent of use cases is LLM output has no impact on program flow
LLM output controls function execution. It is agentic in some sense. It has autonomy to select tools
LLM output controls iteration and has an impact on the program execution. Multi-step agent
There is a gap out there. It is really hard to create an agent. It is really challenging.
Few reasons why building agent is hard
- Execution is nondeterministic and graph traversal is unpredictable
- High fuzziness from LLM, vector lookups and tool calls
- Hard to trace, debug or replay sessions
- Difficult to measure quality or evaluate progress
- Prompts stand in for clear conditional logic
- Plus all the usual software issues like exceptions, bugs and failing APIs
Prompts are not expressive as conditional logic
Typically one changes the System Prompt but it might break other test cases: Vibe testing
Learned a new terminology vibe testing
Instead of manually testing a dataset, we have an evaluation. Test new prompt against data and metric. See the semantic similarity between the response and expected output
Building agents is not pure software work. It is closer to ML
Where did the learning go ?
- Once you build your agent, it’s static
- No matter how much data you get from production, there’s no way to automatically learn from it
Missing Layers: Automatic optimization
- System prompt learning
Opik : Open source GenAI Optimization and Observability
- Provides tracing, observability and optimization
Opik HAPO - Automatic agent optimizer
- Runs the task and collects failure cases instead of treating all errors equally. Clusters mistakes into failure modes (e.g., reasoning errors, formatting slips, hallucinations).
- Generates targeted patches to fix each failure mode, not full prompt rewrites
- Verifies that changes fix the error mode without harming overall behavior.
- Mental model: A doctor diagnosing symptoms and prescribing targeted treatments.
- No user intervention required
HAPO used to improve system prompts
Next Frontier in GenAI
- Why manually when you can learn from data
- Collect production data and annotate based on downstream actions. Based on downstream actions, you could annotate
- Run nightly or weekly optimization jobs
- We can get better system prompts from production data
HAPO does not touch LLM weights
evals - you do not need a massive dataset. 50 to 100 samples is good enough.
Developer puts some 10 samples. Opik can generate samples. You can collect from prod. Optimization is not available in Opik
Optimizing System prompt is one of the best investments to improve AI agents
Two types of security issues: malicious users ,authenticate agents
Cost per token tends to cut by 90 percent
Focus on quality, cost will anyway come down
Opik optimizer can also optimize parameters used in retrieving data from vector database

Summary

Introduction

This keynote is centered around one central message: 2025 is widely being marketed as “The Year of Agents,” but the truth is more nuanced. Enterprises are not yet deploying fully autonomous agents at scale — but they are beginning to seriously build, test, evaluate, and optimize agentic systems for the first time.

Gideon Mendels, drawing from his experience building Comet ML and working with hundreds of ML teams, explains why agentic AI is not just the next hype cycle but a fundamentally different computing paradigm — and why building agents is dramatically harder than people assume.

Using the talk structure shown on the Agenda slide (page 3) :contentReference[oaicite:1]{index=1}, the keynote moves through:

The evolution of AI
Why building agents is incredibly hard
What is missing in current agent stacks
The role of tracing and evaluation
Automatic optimization (HAPO)
The future of GenAI and continuous learning

The result is a practical and brutally honest explanation of where agentic AI stands today and what the next frontier will be.

1. The Evolution of AI

(Slide p.4–6 with growth curves and agentic intro)

The talk begins with the classic story: bigger models + bigger datasets = better results. On page 5, the slide shows a curve from AlexNet ? GPT-1 ? GPT-3 GPT-5, demonstrating how scaling has historically been the driver of performance improvements :contentReference[oaicite:2]{index=2}.

But Gideon’s message is: Scale has given us intelligence, but not autonomy.

LLMs today are:

Brilliant at predicting the next token
Terrible at executing multi-step tasks
Passive respondents, not active problem-solvers

This sets up the shift toward agentic systems.

On page 6, the slide transitions to Agentic Systems — signaling that the next leap in AI performance will come not from even larger models but from systems that can act, reason, and adapt across steps :contentReference[oaicite:3]{index=3}.

The evolution moves through three agent levels (page 7):

Level — LLM output has no impact on program flow
Level — LLM output controls which function/tool executes
Level — LLM output controls looping, continuation, iteration True multi-step agent behavior (e.g., while llm_should_continue(): execute_next_step())

This matches your notes:

“90 percent of use cases LLM output has no impact on program flow”
“LLM output controls function execution”
“LLM output controls iteration and impacts program flow”

The deck formalizes this progression, showing that the majority of today’s “agents” are not truly autonomous.

2. What is an Agent, Really? (Slide p.7)

The grid on page 7 clarifies what qualifies as a real agent:

Simply using an LLM is not an agent.
An agent must affect program flow.
An agent selects tools, determines arguments, and drives next steps.

Your notes reflect this: “LLM output controls iteration and has an impact on program execution. Multi-step agent.”

Gideon emphasizes that “agentic” does not mean “LLM with a fancy prompt.” It means the LLM drives execution.

This distinction is important because:

90% of GenAI apps today are just wrappers around single-turn LLMs.
Only a small subset of apps allow the LLM to actually control logic.

Thus, the “Year of Agents” is aspirational — not descriptive.

3. Why Building Agents Is Hard

(Slide p.9 — “Why building agents is hard”)

On page 9, Gideon lays out the harsh reality of agent development :contentReference[oaicite:4]{index=4}:

Execution is nondeterministic
- Each run can behave differently.
- Graph traversal is unpredictable.
- Steps change based on minor differences in LLM output.
High fuzziness
- LLMs produce ambiguous outputs.
- Vector search returns approximate matches.
- Tool calls may fail.
Hard to trace, debug, or replay
- Simple logs cannot reconstruct state.
- Hard to reproduce failures.
Difficult to measure quality or progress
- What does “good” look like?
- How do you track improvement?
Prompts stand in for conditional logic
- Instead of clear IF/ELSE logic, you have paragraphs of English.
- “Prompt engineering” becomes a brittle replacement for programming.
Normal software failures still apply
- Exceptions
- Bugs
- Network issues
- API timeouts

Your notes list the exact same challenges.

Gideon’s message: Agents are ML systems disguised as software systems — and they inherit the difficulties of both.

4. “Vibe Testing”: Why Prompt Iteration is Broken

(Slides p.10–12)

On pages 10–12, Gideon demonstrates the current reality of agent iteration:

Developer says: “Why is this agent wrong? Let’s rewrite the system prompt.”
They change the prompt.
They run the agent.
If output “feels” right ? they accept it.
This is basically vibe checking an agent, not engineering it.

Your notes call this out:

“Learned a new terminology: vibe testing”
“Prompts are not expressive as conditional logic”

The slides emphasize:

This process gives no signal about what actually improved.
You have no idea whether the agent is now better or worse.
Rewriting the system prompt often breaks unrelated test cases.
Evaluations are missing.

The deck visualizes this with a flawed workflow cycle on pages 10–11.

On page 12, Gideon introduces the fix: Evaluate new prompts against datasets and metrics. He adds:

You don’t need huge datasets.
50–100 samples is often enough.

This is exactly what your notes captured:

“You do not need a massive eval dataset. 50–100 samples is good enough.”

5. The Missing Layer: Traces and Evaluation

(Slide p.13 — “Where did the learning go?”)

This is the philosophical core of the keynote.

On page 13, Gideon asks: Where did the learning go :contentReference[oaicite:5]{index=5}

LLMs learn from data. But agents do not.

Once an agent is deployed:

It is static.
It doesn’t improve with experience.
Production data cannot automatically feed back into the system.
The only way to improve the agent is manually editing prompts.

Your notes: “There is no way to automatically learn from production data. Agent is static.”

He references a tweet from Karpathy on the slide:

“We’re missing a paradigm: system prompt learning.”

This is the missing layer:

We have training.
We have prompting.
We have workflows.

But we do not have: continuous agent improvement.

This leads to the next major section: automatic optimization.

6. Automatic Optimization (Slides p.14–21)

On page 14, Gideon introduces a title slide for: “The Missing Layer: Automatic Optimization.” :contentReference[oaicite:6]{index=6}

This is where Opik and HAPO come in.

6.1 Opik (Slide p.15) Open-source GenAI Optimization & Observability platform.

On page 15, the slide highlights:

Fully open source
Provides tracing
Observability for prompts, agents, sessions
Optimization capabilities
Supports 100+ LLMs
50+ ecosystem integrations (LangGraph, ADK, etc.)

Your notes accurately captured:

“Opik: Open source GenAI optimization and observability”

This solves multiple problems:

Reproducibility
Debuggability
Tracking
Measurement
Evaluation
Sharing prompt versions
Visualizing agent graphs

6.2 HAPO (Slides p.16–17) The heart of the keynote.

On page 16 and 17, Gideon introduces: HAPO — Automatic Agent Optimizer :contentReference[oaicite:7]{index=7}

It works like root-cause analysis:

Run the agent.
Collect failures.
Cluster failures into modes:
- Reasoning mistakes
- Formatting errors
- Hallucinations
- Missing fields
Generate targeted patches (not full rewrites)
Test that patches fix the error mode without causing regressions
Update the system prompt automatically

Your notes:

“Clusters mistakes into failure modes”
“Generates targeted patches”
“Mental model: doctor diagnosing symptoms”
“No user intervention required”

This is precisely what the slides show.

The core idea: Agents can automatically learn from failure cases triggered by production use.

6.3 Real-Life Example: JSON Structured Output (Slides p.18–21)

On pages 18–21, Gideon shows a real LangChain example:

A system prompt that requires JSON output.
LLM fails consistently.
Developer brute-forces prompt changes — fails again.

HAPO instead:

Collects failure examples
Clusters modes (e.g., formatting errors)
Generates prompt patches
Achieves 708% improvement (pages 20–21)
Uses 21 cents of compute on GPT-4

:contentReference[oaicite:8]{index=8}

This is one of the strongest demonstrations in the keynote.

Your notes reflect:

“708% improvement!”
“Overall cost 21 cents”
“Within 4 minutes we learned a new prompt”

This is the missing layer:

learning from data
prompt optimization at scale
automatic corrections

This is the bridge between:

handcrafted agent logic
and continuously improving agent systems

7. The Next Frontier in GenAI: Continuous System Prompt Learning

(Slide p.22)

On page 22, the slide title is: “The Next Frontier in GenAI: Why manually tune when you can learn from data?” :contentReference[oaicite:9]{index=9}

The speaker explains:

Today:

Agents do not learn from user interactions.
Prompts remain static forever.
Developers hand-edit prompts based on intuition.

Tomorrow:

Agents will gather production data.
Downstream actions (success/failure) become training labels.
Nightly or weekly optimization jobs update prompts automatically.
System prompt learning becomes as normal as retraining a model.
Tools like HAPO produce targeted improvements.

Your notes reflect all these ideas:

“Collect production data and annotate based on downstream actions”
“Run nightly or weekly optimization”
“Better system prompts from production data”
“Optimization is not touching LLM weights — only system prompt”
“System prompt optimization is one of the best investments”

This is the future: agentic fine-tuning without touching model weights.

It makes agent improvement:

Cheaper
Faster
Continuous
Automated

Gideon highlights that:

Cost per token will drop by 90%
So optimization jobs become extremely cheap
Quality will be the main differentiator

8. Additional Points from Your Notes

8.1 Security issues Your notes mention:

“Two types of security issues: malicious users, authenticated agents”

This fits into industry patterns:

Users may try to exploit the system
Agents must be authenticated when calling tools

This underscores why observability and tracing platforms are essential.

8.2 Vector DB parameter optimization Your notes:

“Opik optimizer can optimize parameters used in retrieving data from vector database”

This is a promising direction:

Retrieve too broadly hallucinations
Retrieve too narrowly missing context

HAPO can tune:

Top-k
Similarity thresholds
Chunk sizes
Metadata filters

This represents the next level: data retrieval optimization + system prompt optimization

9. Big Picture Summary

Here is a condensed narrative of the entire talk, integrating slides and your notes.

9.1 The Evolution

Scaling gave us intelligent LLMs.
LLMs are fundamentally passive.
Agents add autonomy.
The industry is moving from GenAI apps agentic systems.

9.2 The Problem Building agents is extremely hard because:

Execution is non-deterministic.
LLMs are fuzzy and unpredictable.
Workflows are impossible to debug without tracing.
Evaluation is missing.
Prompts are brittle replacements for logic.
Agents do not learn from new data.

9.3 The Missing Layer Agents need:

Traces
Observability
Evaluation
Automatic optimization
Continuous system prompt learning

9.4 The Solution Opik + HAPO:

Trace agent behavior
Evaluate it
Cluster failures
Generate prompt patches
Verify improvement
Update system prompts automatically

9.5 The Future In the next few years:

Agents will continuously improve from production data
Developers will spend less time rewriting prompts
Agent optimization pipelines will run automatically
System prompt learning will become a standard ML practice
Agents will shift from static to adaptive

Gideon’s core claim: The next frontier in GenAI is not bigger models — it is agents that learn from their experience.

Final Takeaway

Gideon’s keynote, reinforced by the deck and your notes, argues that:

*2025 is not the year everyone deploys agents; 2025 is the year we finally learn how to *build and optimize them.**

Agentic AI will require:

Proper tracing
Meaningful evaluation
Automatic prompt optimization
Continuous learning
Clear separation between LLM and agent logic

The era of static prompts is ending. The era of learning agents is just beginning.

The Anatomy of Action: How AI Agents Reason, Plan and Execute

Speaker

Imran Ahmad, PhD Data Scientist | ML Algorithms Expert , Author of “50 Algorithms Every Programmer Should Know” Ottawa

Learnings

Reason plan and execute are the main tasks of an agent
Three forces that are driving the rise of agents
1. As the tech around AI is more stable, it is an explosion of task complexity
  - Agents are supposed to handle complexity
2. There is demand for speed and scale. Reliable, reevaluate and learn. Continuous loop has become necessary
3. LLM are becoming a misnomer. Agents are no longer text generator. Tied up with the capabilities of agents
books written by the speaker
- 40 algos that every programmer should know
- 50 algos that every programmer should know
- architecting ai software systems
- 30 agents every ai engineer should build
Dawn of autonomous agent
Transition from traditional software to autonomous agents represents a fundamental paradigm shift that transforms how digital systems operate and interact with their environments
Agents are close to how humans work
Author uses an example of generating a report
- Discover the world around you
- Orchestrate the plan
- Execute the plan
- Surprises along the way - adjust in real time
- Through out the execution, there is a goal in my mind and I am moving in that direction
Three characteristics that determine agentic behavior
- Goal directed behavior
  - Agents don’t need step by step instructions
  - You give them an objective : schedule this, analyze that, generate options for this, and craft a plan
- Persistent State
  - carry memory, past steps, previous failures, intermediate observations, user preferences
- Observe and Adapt
  - This is the most transformative feature. Agents don’t execute blindly
Agents are becoming verticalized
Architecture of agents
1. Perceive
  - Capturing data from the environment through user input, APIs, sensors, or external systems and converting it into structured formats.
2. Reason
  - Contextualizing perceived information, applying pattern recognition and inference to extract meaning and relevance.
3. Planning
  - Orchestrating insights into coherent action sequences, decomposing objectives into tasks and prioritizing steps.
4. Action
  - Executing selected steps, interfacing with external tools, APIs, databases, or systems to operationalize decisions.
5. Learning
  - Analyzing outcomes, measuring success, and updating internal models or memory stores to refine future behavior.
Communication among agents
- CrewAI
- LangChain LangGraph
- Agent Development kit
Reactive agents
Deliberate agents
Agent Development Lifecycle
Five levels of agent maturity
Do not over complicate
It is not about the number of tools, how many interactions are needed. One agent 10 tools or 1 agents have 5 sub agents and they have tools. It is not about number of tools. It is the capability of the tool. If reasoning is different in the same use case, you need to have multiple brains
- Agents can tackle as many tools as possible.
AGI is not achievable : 100 times you debate with someone and 100 times the agent will win, they we would have achieved AGI

Summarized

Introduction

This talk presents a comprehensive and practical explanation of what AI agents really are, how they function, and why they represent a true architectural shift in computing. Imran Ahmad frames agents not as “fancy LLM wrappers,” but as autonomous systems modeled after human cognitive processes—systems that perceive, reason, plan, act, and learn within dynamic environments :contentReference[oaicite:1]{index=1}.

Agents are powerful precisely because they deal with complexity. They don’t need step-by-step instructions; they figure out pathways toward a goal, maintain memory, adapt their behavior, and work through unexpected conditions. Imran positions the rise of agents as the first fundamental shift since the move from procedural programming to object-oriented thinking.

Your notes mirror the same core message: agents = reason + plan + execute.

This summary integrates both the talk and the deck into a clear, unified narrative.

1. Why Agents, and Why Now?

(Slide: “The Dawn of Autonomous Agents”) :contentReference[oaicite:2]{index=2}

Imran begins by identifying the three forces pushing the rise of agentic AI:

1. Explosion of Task Complexity AI infrastructure and models have stabilized enough that organizations now want systems that can handle complex, multi-step, cross-application tasks—not just answer questions. Agents are meant to handle this complexity naturally.

2. Demand for Speed and Scale Businesses want systems that work continuously, adapt to new data, and learn from outcomes. This creates the need for continuous loops: observe evaluate adjust act.

3. LLMs Are Becoming a Misnomer LLMs are no longer “text generators.” They are becoming *controllers*—engines that orchestrate tools, workflows, knowledge bases, APIs, and other agents.

Your notes reflect this shift beautifully:

“LLM output controls function execution”
“LLM output controls iteration”
“Agents are no longer text generators”

Agents turn passive foundation models into something that behaves more like a team member than a chatbot.

2. The Paradigm Shift: From Software to Autonomous Agents

(Slides 3–6) :contentReference[oaicite:3]{index=3}

Traditional software:

Executes predefined instructions
Cannot adapt without code changes
Has no sense of goals
Has no ongoing memory

Agents:

Understand objectives, not instructions
Work toward goals through planning
Keep persistent state
Adjust strategies based on incoming data
Learn from outcomes

Imran emphasizes that agents operate more like humans:

Discover the world
Build a plan
Execute
Handle surprises
Keep walking toward a goal

Your notes explicitly reference this example, where Imran relates agent behavior to the process of preparing a report—perceiving inputs, planning a structure, executing steps, and adapting to surprises mid-way.

This comparison is intentionally used to anchor agent cognition in something intuitively human.

3. What Makes an Agent “Agentic”?

(Slide: “Key Traits of Intelligent Agents”) :contentReference[oaicite:4]{index=4}

Imran lists six defining traits of an intelligent agent. Your notes condense these into three pillars, but both maps align:

1. Goal-Directed Behavior The agent is given a goal (“analyze this,” “schedule that,” “research options”) —not instructions. It must figure out how to achieve it.

2. Persistent State (Memory) Agents remember:

Past actions
What worked / failed
User preferences
Intermediate progress

This enables continuity across steps.

3. Observe and Adapt The most transformative trait. Agents monitor the environment and adjust decisions instead of blindly following scripts.

The deck expands this into six traits:

Autonomy
Persistence
Reactivity
Proactiveness
Adaptability
Goal-orientation

These combine to create systems that behave strategically, not procedurally.

4. The Cognitive Architecture of Agents

(Slide: “Five Phases of the Cognitive Loop”) :contentReference[oaicite:5]{index=5}

At the heart of the talk is the “Cognitive Loop,” a 5-phase architecture:

1. Perception Gathering data from the world—via user queries, APIs, sensors, logs, or tools.

2. Reasoning Understanding context, recognizing patterns, extracting meaning.

3. Planning Breaking down large goals into sequences of actionable subtasks.

4. Action Calling external tools, invoking APIs, updating documents, triggering workflows.

5. Learning Evaluating results, identifying mistakes, updating memory or model parameters.

Imran explains that this loop turns passive LLM output into an ongoing, autonomous cognition cycle. The slides even include code snippets illustrating how Perception and Reasoning work in practice (pages 14–18) :contentReference[oaicite:6]{index=6}.

Your own summarized version is exactly aligned:

Perceive
Reason
Plan
Act
Learn

This is the core “anatomy of action.”

5. Agent Communication & Collaboration

(Slide: “Communication Patterns”) :contentReference[oaicite:7]{index=7}

Imran stresses that agents are not isolated systems—they often collaborate or build upon each other’s actions.

Communication happens at two levels:

1. Vertical communication (Agent Tools) Frameworks:

CrewAI
LangChain / LangGraph
Google’s Agent Development Kit (ADK)

These provide:

APIs
Tool registries
Execution graphs
Sub-agent orchestration

2. Horizontal communication (Agent Agent) Agents pass:

State
Roles
Subtasks
Intermediate results
Signals (success/failure/ready)

Your notes capture this high-level idea:

“Communication among agents”
“Reactive agents”
“Deliberate agents”

The deck dives deeper by introducing two key interoperability protocols:

MCP (Model Context Protocol) – governs vertical interactions A2A (Agent-to-Agent Protocols) – governs horizontal collaboration :contentReference[oaicite:8]{index=8}

These protocols define how intelligence flows across an ecosystem of agents.

6. Choosing an Agent Brain: Reactive, Deliberative, or Hybrid

(Slides: “Reactive Agents,” “Deliberate Agents,” “Hybrid Agents”) :contentReference[oaicite:9]{index=9}

Imran introduces three classical agent design patterns:

1. Reactive Agents (Fast Reflexes)

Stateless
Rule-based
Extremely predictable
Handles urgent scenarios

Examples: anti-lock braking systems, fire alarms, motion sensors.

2. Deliberative Agents (Strategic Thinkers)

Pause analyze plan act
Maintain internal models
Handle multi-step goals

Example: a travel planner that checks visas, flights, hotels, constraints.

3. Hybrid Agents (Best of Both Worlds)

Reactive layer handles time-critical actions
Deliberative layer handles planning
They communicate and share parameters

Examples: self-driving cars, robotics, cybersecurity systems.

Your notes reflect the same categorization, though simplified.

7. Cognitive Infrastructure: The Cognition Core

(Slides: “Cognition Core” and “Five Foundational Communication Layers”) :contentReference[oaicite:10]{index=10}

At the center of an agent system is the Cognition Core:

Synthesizes input
Resolves conflicts
Coordinates actions
Maintains coherence across state

In production systems, this often includes:

Redundant processes
Distributed coordination services
System health checks
Recovery logic

Then Imran introduces five foundational communication layers that connect the core to the rest of the system:

Profile/Persona
Tool Use / Action Interface
Planning / Feedback
Knowledge / Memory
Reasoning / Evaluation

This model is deeply architectural and reflects how cloud-scale agent platforms (ADK, LangGraph, etc.) internally manage complexity.

8. Interoperability Protocols

(Slides: “Model Context Protocol,” “A2A Protocols”) :contentReference[oaicite:11]{index=11}

Imran highlights that agents only become powerful when:

They can plug into new tools easily
They can talk to each other without brittle integrations

MCP (Model Context Protocol) handles vertical integration:

Tools describe their capabilities
Agents discover appropriate tools
Invocation is standardized through schemas
Example: Flight search with input/output JSON schemas (slide page 48)

A2A Protocols handle horizontal collaboration:

Shared state
Status messages
Role assignments
Division of labor among agents

Your notes capture this with:

“CrewAI”
“LangChain LangGraph”
“Agent Development Kit”

But the deck expands this into a formalized, standards-based architecture for large-scale agent ecosystems.

9. Agent Development Lifecycle

(Slides: “Agent Development Lifecycle”) :contentReference[oaicite:12]{index=12}

Imran provides a 5-phase development cycle:

Conceptualization
- Define problem
- Define goals
- Define success metrics
Architecture & Design
- Choose agent pattern
- Define memory and cognition
- Understand tool set
Implementation
- Pick frameworks
- Implement modules
- Build workflow graphs
- Run simulations
Evaluation
- Measure task completion
- Assess decision quality
- Validate edge cases
- Run regression tests
Governance
- Monitor logs
- Audit decisions
- Handle failures
- Ensure compliance
- Manage lifecycle

The process mirrors MLOps but extends it to tool usage, planning, memory, and multi-step workflows.

10. Five Interaction Levels of Agentic Systems

(Slides: “Five Agent Interaction Paradigms”) :contentReference[oaicite:13]{index=13}

Imran introduces a five-level framework to classify agent capabilities:

Level 1 – Direct LLM Stateless, no autonomy
Level 2 – Proxy Agent Converts NL structured actions
Level 3 – Assistant System Tool-augmented, session memory
Level 4 – Autonomous Agent Multi-step planning, persistent state
Level 5 – Multi-Agent System Distributed decision-making (e.g., smart homes, fleets of robots, trading systems)

Your notes reflect the idea:

“Five levels of agent maturity”
“Do not overcomplicate”

This framework helps developers know where their system realistically sits.

11. Real-World Business Impact

(Slides: “Real-World Business Impact”) :contentReference[oaicite:14]{index=14}

Imran provides real metrics showing that advanced agentic systems already produce enormous value:

Quandri – Insurance automation

1000s of policies daily
99.9% accuracy
$30K+ MRR

My AskAI – Customer support

30-second automated handling
99% satisfaction
$25K MRR

Enterprise Bot – Agentic sales

Tripled qualified leads
50% reduction in acquisition cost
>$2M ARR

These examples show that agentic systems are no longer experimental—they are profitable.

The “gap widens” slide highlights that businesses adopting agents gain massive efficiency, while others fall behind.

12. Five Levels of Agent Maturity

(Slide: “Agentic AI Progression Framework”) :contentReference[oaicite:15]{index=15}

Another five-level model maps where companies fall:

Manual operations
Reactive agents
Tool-using agents
Planning agents
Learning agents

Your notes capture similar ideas:

“Five levels of agent maturity”
“Agents are getting verticalized”
“Don’t overcomplicate”

This framework helps enterprises roadmap their agent evolution.

13. A Critical Insight: It’s Not About Tools

Your notes captured an important insight from Imran:

“It is not about the number of tools. It is the capability of the tool. If reasoning differs in a use case, you need multiple brains.”

This is architecturally profound.

It means:

A single agent with 10 tools may be simpler and more efficient
But if you need distinct reasoning styles, you need multiple agents

This reinforces why multi-agent systems exist—not for complexity, but for specialization.

14. AGI Discussion

Your notes mention:

“AGI is not achievable: If you debate the agent 100 times and it wins 100 times, then AGI exists.”

This reflects Imran’s position:

AGI is not imminent
Agents may appear intelligent
But reasoning quality remains inconsistent, brittle, and context-dependent

This discussion is included near the end of the talk to ground expectations: agents are powerful, but they are not replacements for human-level cognition.

Final Takeaways

Imran’s core message can be summarized as:

Agents are the next major computing paradigm.

They:

perceive
reason
plan
act
learn

They are modeled after human cognitive loops and are capable of operating in dynamic, uncertain environments with persistence and adaptation. Modern frameworks, interoperability protocols (MCP, A2A), cognitive architectures, and well-defined development lifecycles are turning agent engineering into a structured discipline.

Agents are not about tool count—they are about capabilities, goals, and cognition. Businesses adopting these systems are already seeing transformative results.

Agent engineering is not hype—it is the next era of AI-enabled software engineering.

Fireside Chat: Agentic AI Architectural Patterns

Context

“Join our Fireside Chat on Agentic AI Design Patterns to explore how autonomous AI agents are shaping next-generation applications. Learn about the core design principles, real-world use cases, and emerging best practices behind building intelligent, self-directed systems.”

Speakers

Juan Bustos Lead AI Partner Engineer, Google Miami-Fort Lauderdale Area

Juan Bustos is a seasoned technology professional specializing in Artificial Intelligence and Machine Learning. With a background in computer science, Juan has held leadership positions at major tech companies including Google, Stripe, and Amazon Web Services. His expertise spans AI services, solution architecture, and cloud computing. Juan is passionate about helping organizations leverage cutting-edge technologies to drive innovation and deliver value

Dr. Ali Arsanjani Director, Applied Engineering, Google San Diego

Dr. Ali Arsanjani is the Director of Applied AI Engineering at Google Cloud, and Head of the Global AI Center of Excellence, leads GenAI strategic co-innovation, thought leadership & alliances.

Previously he led all GenAI initiatives in the Google Partner Ecosystem, developing strategic co-innovation assets & partnerships in the fields of Generative AI, Data/Analytics & Predictive AI/ML. His team specializes in co-innovation with ISV and GSI partners as they run, integrate and build on GCP across the ML Lifecycle. Ali also works closely with product management to shape the direction of Google’s AI and analytics offerings from a cloud perspective.

Ali is an Adjunct Prof at San Jose State University & the University of California, San Diego, and advises students in the Masters in Data program and the Data Science Institute, respectively.

Resources

Book will be released in 2026
- Agentic Architectural Patterns for Building Multi-Agent Systems: Proven design patterns & practices for GenAI, Agents, RAG, LLMOps & enterprise-scale AI systems
Table of Contents
- GenAI in the Enterprise: Landscape, Maturity, and Agent Focus
- Agent-Ready LLMs: Selection, Deployment, and Adaptation
- The Spectrum of LLM Adaptation for Agents: RAG to Fine-tuning
- Agentic AI Architecture: Components and Interactions
- Multi-Agent Coordination Patterns
- Explainability and Compliance Agentic Patterns
- Robustness and Fault Tolerance Patterns
- Human-Agent Interaction Patterns
- Agent-Level Patterns
- System-Level Patterns
- Agent Frameworks: Tools for Building Agentic Applications

Learnings

Talk was very generic and all it said that businesses have to think about creating business value rather than jumping on the AI bandwagon without thinking

Fireside Chat: LangChain and Frameworks

Context

Join a fireside chat with LangChain CEO Harrison Chase and Google Engineer Leonid Kuligin as they dive into the evolution of LangChain, its role in shaping the LLM ecosystem, and how it compares with other AI frameworks. Gain insights into the future of AI application development and best practices for building with cutting-edge language model tools.

Speakers

Leonid Kuligin AI Engineer at Google Cloud, Google Greater Munich Metropolitan Area

Leonid Kuligin is a staff AI engineer at Google Cloud, working on generative AI and classical machine learning solutions (such as demand forecasting or optimization problems). Leonid is one of the key maintainers of Google Cloud integrations on LangChain, and a visiting lecturer at CDTM (TUM and LMU). Prior to Google, Leonid gained more than 20 years of experience in building B2C and B2B applications based on complex machine learning and data processing solutions such as search, maps, and investment management in German, Russian, and US technological, financial, and retail companies.

Harrison Chase Co-founder and CEO, LangChain San Francisco

Harrison Chase is the co-founder and CEO of LangChain, the open-source framework that has become a cornerstone for building applications powered by large-language models (LLMs). Before founding LangChain in late 2022, Chase earned a degree in statistics and computer science from Harvard University. He then worked at Kensho Technologies, where he led the entity-linking team, and at Robust Intelligence, where he directed machine-learning efforts focused on model testing and validation. Under his leadership, LangChain has grown rapidly—gaining millions of installations and widespread adoption across the AI developer community.

Learnings

Back in the days when Harrison chase wanted to started a company, many people were doing stuff via images. Created a python package so as to learn about the space.
First version came out and ChatGPT came out and then it created a lot of excitement
What paper should I read next ? Can I implement in langchain ?
All of AI is on Twitter
Integrations with google was one component but the other thing was playbooks for various different papers and techniques
What do the agents of the future look like? What tools are needed to make it a reality. Early one, LangChain was more like putting templates in to a library. LangGraph was a lower level framework. It did not have prebuilt context engineering and it was meant to be low level. LangSmith was built for observability and evals. Recently the company has built something around deep agents. Agents harness. We see these long running agentic tasks and put it in agent harness. There are somethings that change rapidly. There are 200 different frameworks out there.
Observability and evals are always needed as you want to understand how the agents work and what they are working on
Yearly plan: Observability and evals is something that the company is focusing big time. The other thing that the company is working on is scalability.
- Observability and evals
- Deep agents
Lot of what will be build will be based on what deep agents need
Deep agents are long running and stateful
Company has a deployment platform. They have also launched no code platform. it is not a drag and drop thing. It is deep agents via UI
Deep agents use code sandboxes. company is integrating with third parties
Deep agents need memory and the company is probing on those aspects
Two things that company is focusing on is
- Observability and Evals
- Deep research
Why evals are difficult?
- Macro reason: a big part of evaluation is about doing product requirements. It is like what agent should do. You have to get a bunch of requirements. It helps a lot. Writing evals is relating to PRD. It is like writing product requirement document.
- Micro reason: Even if you have a good sense of what to do, there are two parts. data that you are going to eval on and the metrics
- How to build the datasets that you are going to eval on ? For deep agents, it is pretty hard to eval
- Agents are good for code is being done because code eval can be done. evals can be done for code
- Especially for complex agents, coming up with evals is difficult
Building LLM as judge is same as building an LLM application
LangSmith : Create LLM as judge. Aligned evals. Before you write anything, you write evals and then benchmark. It forces you to build patterns for the app
Binary classification works well for LLM as a judge
Costs can become a factor. Smaller LLM can be used as an edge
Treat the LLM judge as any other app. You must spend time on prompts.
You can do online evals or offline evals
Support online evals
Take traces and run LLM as a judge.
Conversational evals in LangSmith
If a thread has been updated in 30min, then assume it is finished and run eval
Thread evals are supported in LangSmith
Voice evals will also be supported in LangSmith
Trajectory evals is also being worked on
LangSmith is a build datasets, run experiments, track them over time. It is not pass fail like unit testing.
LangGraph course is worth checking and is useful
Agent from a technical perspective is an LLM running in loop and helping it to decide a bunch of stuff.
Andrew Ng says it is better to say agentic and it depends on how much LLM is allowed to take a decision
How much can you let LLM decide ? You set the state of the graph as a typed dictionary. Deep agents work on file system. The state is a file system. Letting agents not just manage the state but also manage and create different states, once you extend them in to file system
Multi agentic system: Internal predictability is confused.
- Biggest issue is communication. it is really hard. human to agent or agent to agent or whatever kind of communication you can think of, there are problems
- Least buggy is One agent calling another agent as a tool. it calls a tool and gets back with a response
- Main agent might not call sub agent with enough information. Prompt main agent to call a sub agent. Prompt the sub agent also respond properly
- You have to tell agents on how to communicate. There are absolutely other ways to do it. You can do routers, You can do hand off. Practically speaking if you treat one agent and others as tool calls, that is the easiest way to get stuff done
Evaluation of multi agent
- Evaluation each agent
- Evaluate whether the agent calls other agents properly
- Evaluate end of end
Difficulty of developing an agent ?
- It is really all context engineering. It is all communication at the end of the day. Bringing the right context to the LLM. The ways problems crop up
  - Context is too long
  - Context is not correct
  - No tools to get the context
  - No instructions to access tools
- LangGraph is very low level. No hidden prompts, no context engineering. LangGraph provides low level flexibility
- Deep agents : agent harness ; LangChain and LangGraph are agent abstractions. deep agents have tools, compaction etc. It is like claude code. They are all about best practices about context engineering
- LangChain - LangGraph - Deep Agents : That’s the way you can think of decision level when you think of what kind of abstraction that you want
- Hard part of building agents is context engineering. We need to know how the context is working
Good prototype that shows an ok performance and how do we go to production
- Make the performance measurable.
  - There are stages to it.
  - One of the things is writing evals done too early. There is a risk for missing forest for the trees. Being able to measure performance is nice but may be 5 evals are enough.
  - If you write evals on a tool and then you completely get rid of it, then it is a waste of time on building evals
  - Tracing is very important
- 5 evals and 5 to 10 data points is enough to build an agent. For evals, you can take a small number of data points and start testing
- Immediate blocker is the quality of the agent. Updating the prompt, observability and eval is handy.
- Building on prototype, iterate on quality, evals, measuring the quality
- Long running agents have problems when you want to evals
- Build-test-deploy-manage
Three things will contribute to “general notion that LangChain is not production ready”
- Build on top of integrations and hence it was changing quickly. 1.0 release done
- A lot of more infra were in the original LangChain. Added in streaming support, chat message history, hit. Slowly added to LangChain
- In LangGraph the company wrote everything with all the stuff at once.
- LangChain - there were bunch of hidden prompts and context engineering that was not obvious.
Why do you think observability for Gen AI is different ?
- Tech perspective, you are sending different type of payloads - large text and multi modal
- Data flywheel in genai that doesn’t exist in software application. You have traces that you can open up. It is evals, it is human annotation. What do you have logs and what to do with logs is different
- Profile of the users is different. In langsmith, product, data scientists, and everyone is on langsmith
LangGraph to production: Cost is on the LLM side rather than the infraside. Some lambdas don’t support long running tasks. Hence for longer running, you want some persistence and hence you might have to have db. Main cost is usually the LLM costs
LangChain 1.0 has the same interface for 70 models. Content strings vary from provider.

Power Talk: Ethics and Security

Context

Join our Tech Talk on Ethics and Security in AI, where experts explore the responsible use of large language models, data privacy, and bias mitigation. Discover how to build trustworthy, transparent, and secure AI systems that align with global ethical standards.

Speaker

Heather Dawe Chief Data Scientist, UST Leeds

A Data and AI Leader with 25 years experience working across industry, innovating with data and the ways in which it can be used to improve outcomes, quality and efficiency.

A recognised AI Thought Leader, Heather has appeared on the BBC, Sky News and in numerous national and international news publications including The Guardian, Financial Times and the Economic Times. With her UST colleague Dr. Adnan Masood, she is co-author of the 2023 book ‘Responsible AI in the Enterprise', a guide to how to succeed with AI in business safely, fairly and ethically. In January 2025 she spoke on a panel discussing AI Ethics at World Woman Davos, part of the World Economic Forum.

Heather has worked at a senior level as a Statistician and Data Scientist within government, the wider public sector and industry. With expertise in the health, retail, telecommunications, insurance and finance domains, she loves to problem solve and innovate with real world challenges, and is as comfortable doing this within large Enterprises and academia as she is within start-ups and incubators.

An effective team builder, Heather pioneered the development of multi-disciplinary data science teams within the UK public sector.

With a love for her own learning Heather is also passionate about helping others to develop their skills and expertise. She is an advocate for democratising AI as well as achieving greater diversity in those who develop it

Learnings

General talk on responsible AI.
Learned about the attack on claude code

Contents

Keynote Session: Evolution of LLMs from Foundations to Agents

Context

Speaker

Learnings

Summary

Introduction

1. Industry Context: “2025 Is the Year of AI Agents?” (Slide p.3)

2. Four-Stage Evolution of AI: From Perception to Physical Autonomy

3. The Enterprise Automation Curve (Slide p.5)

4. From LLMs ? Compound AI ? Agentic Systems (Slide p.6)

5. IBM’s View From the Field: 4000 RAG Pilots in 2 Years

6. Why LLMs Alone Are Not Enough (Slide p.7)

7. How Agents Solve These Limitations (Slide p.7)

8. The Evolution of AI Agents (Slide p.8)

9. Degrees of Autonomy (Slide p.9)

10. The Complexity of Building Agents (Slide p.10)

11. The Agent Market Explosion (Slide p.11)

12. Tool Fragmentation Problem (Slide p.12)

13. A Unified Layer for Enterprise Agents (Slide p.13)

14. Multi-Agent Enterprise Architecture (Slides p.14–15)

15. Is 2025 the Year of AI Agents? (Slide p.16)

16. Summary of Your Notes Integrated With the Slides

Final Takeaway

Tech Talk: Reasoning in LLMs

Context

Speaker

Resources

Learnings

Summary

Introduction

The Human Mind as a Model for AI Reasoning

From Chain of Thought to Actual Reasoning

ReAct: Reasoning + Acting

Moving From Pattern Matching to Learned Cognition

Introspective Models: Early Metacognition in LLMs

Test-Time Compute: Giving Models Time to Think

When Should Models Think?

Adaptive Compute Allocation (AdaptThink)

Structured and Agentic Reasoning

Memory: The Next Step in Reasoning Models

The Bigger Picture: LLMs Becoming Cognitive Systems

Final Summary (In Simple Words)

Tech Talk: Agentic AI Industry Use-cases / Agentic AI in Finance

Context

Speaker

Learnings

Evolution

Overall Message of the Slide

The Four Stages

Key Idea

Summary

Big Picture

From Prompts to Agentic Systems

Current Industry Trends

Changing User Interfaces

Core Use-Case Families

Functional Use Cases by Department

Key Takeaways

Tech Talk: Agentic LLM Architecture by Hand

Context

Speakers

Learnings

Round Table: AI Engineering

Context

Speakers

Chip Huyen Researcher, Tep Studio San Francisco

Paul Iusztin

Maxime Labonne

Learnings

Multimodal AI

Context

Speaker

Learnings

Workshop: Context Engineering for LLMs

Context

Speaker

Resources

From Prompt to Context Engineering

Links